开发者

Excluding some characters from a Regular Expression range

开发者_如何学JAVAI have a regex that selects words in a unicode range

[\u0D80-\u0DFF]*

I want to exclude words that include a certain character for example \u0D92.

How should I change the expression?


.Net supports another notation for Character Class Subtraction:

[\u0D80-\u0DFF-[\u0D92]]*

Example (using the .Net engine): http://regexstorm.net/tester


Just build two ranges; that is, make gaps in your range for the values you wish to exclude...

[\u0D80-\u0D91\u0D93-\u0DFF]*


You could can subtract characters from a character class by doing

[\u0D80-\u0DFF&&[^\u0D92]]*

[a-z&&[^egi]] matches all characters from a to z except e, g and i.


Use lookaheads to implement set intersection:

(?x)(?:
     (?!\x{d92})
     [\x{d80}-\x{dff}]
)

That creates an atom that fits your criteria. Qualify at will.

I don't trust your \uXXXX notation. It is always a bad sign when you see something that uses it, because it is some ancient Unicode 1 legacy notation that assumes Plane 0 only. That means it hasn't been useful since Unicode 2, way back deep into the previous millennium. I would avoid it if at all possible, because you don't want into get into bad habits that don't work for 16/17th of the Unicode namespace.

I have therefore used the standard \x{...} notation used in Java 7, ICU, and Perl, which is not bigotted against Planes 1-16 of Unicode. Indeed, in the languages acursed with a UTF-16 representation (yes, Java, I'm looking at you), that is the only possible way to do non-BMP ranges.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜