Excluding some characters from a Regular Expression range
开发者_如何学JAVAI have a regex that selects words in a unicode range
[\u0D80-\u0DFF]*
I want to exclude words that include a certain character for example \u0D92.
How should I change the expression?
.Net supports another notation for Character Class Subtraction:
[\u0D80-\u0DFF-[\u0D92]]*
Example (using the .Net engine): http://regexstorm.net/tester
Just build two ranges; that is, make gaps in your range for the values you wish to exclude...
[\u0D80-\u0D91\u0D93-\u0DFF]*
You could can subtract characters from a character class by doing
[\u0D80-\u0DFF&&[^\u0D92]]*
[a-z&&[^egi]]
matches all characters from a
to z
except e
, g
and i
.
Use lookaheads to implement set intersection:
(?x)(?:
(?!\x{d92})
[\x{d80}-\x{dff}]
)
That creates an atom that fits your criteria. Qualify at will.
I don't trust your \uXXXX
notation. It is always a bad sign when you see something that uses it, because it is some ancient Unicode 1 legacy notation that assumes Plane 0 only. That means it hasn't been useful since Unicode 2, way back deep into the previous millennium. I would avoid it if at all possible, because you don't want into get into bad habits that don't work for 16/17th of the Unicode namespace.
I have therefore used the standard \x{...}
notation used in Java 7, ICU, and Perl, which is not bigotted against Planes 1-16 of Unicode. Indeed, in the languages acursed with a UTF-16 representation (yes, Java, I'm looking at you), that is the only possible way to do non-BMP ranges.
精彩评论