Allow only (English & Arabic) in UTF-8 code
I am looking for a regex to change all non-english and开发者_C百科/or arabic into underscore "_"
Currently I have the following code which works but I think that I've got the wrong unicode
range as it allows Chinese & other languages I don't require in my script.
$title=~tr/[a-z0-9_\x7f-\xff]/_/cd;
Any help would be appreciated
If you're seeing bytes between \x7f
and \xff
, your application is probably working with UTF-8 bytes, not Unicode characters. Read perldoc perlunicode
, then decode()
your strings before trying to work with them on this level.
Once that's done, you should be able to search for English and Arabic characters with something like:
/[\p{ASCII}\p{Arabic}]/
See perldoc perluniprops
for other Unicode properties you can use.
The range of the Arabic (Indic) digits is: \x{0660}-\x{0669}
The range of the Arabic letters is: \x{0621}-\x{063A}\x{0641}-\x{064A}
The range of the Arabic vowels including "Tatweel" is: \x{0640}\x{064B}-\x{0652}
The range of the Arabic puncation is: \x{060C}\x{060D}\x{061B}-\x{061F}\x{2E2E}\x{066A}-\x{066D}
精彩评论