开发者

Allow only (English & Arabic) in UTF-8 code

I am looking for a regex to change all non-english and开发者_C百科/or arabic into underscore "_"

Currently I have the following code which works but I think that I've got the wrong unicode

range as it allows Chinese & other languages I don't require in my script.

$title=~tr/[a-z0-9_\x7f-\xff]/_/cd;

Any help would be appreciated


If you're seeing bytes between \x7f and \xff, your application is probably working with UTF-8 bytes, not Unicode characters. Read perldoc perlunicode, then decode() your strings before trying to work with them on this level.

Once that's done, you should be able to search for English and Arabic characters with something like:

/[\p{ASCII}\p{Arabic}]/

See perldoc perluniprops for other Unicode properties you can use.


The range of the Arabic (Indic) digits is: \x{0660}-\x{0669}

The range of the Arabic letters is: \x{0621}-\x{063A}\x{0641}-\x{064A}

The range of the Arabic vowels including "Tatweel" is: \x{0640}\x{064B}-\x{0652}

The range of the Arabic puncation is: \x{060C}\x{060D}\x{061B}-\x{061F}\x{2E2E}\x{066A}-\x{066D}

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜