Excluding some characters from a Regular Expression range

2023-03-29 05:12 问答作者：

开发者_如何学JAVAI have a regex that selects words in a unicode range

[\u0D80-\u0DFF]*

I want to exclude words that include a certain character for example \u0D92.

How should I change the expression?

.Net supports another notation for Character Class Subtraction:

[\u0D80-\u0DFF-[\u0D92]]*

Example (using the .Net engine): http://regexstorm.net/tester

Just build two ranges; that is, make gaps in your range for the values you wish to exclude...

[\u0D80-\u0D91\u0D93-\u0DFF]*

You could can subtract characters from a character class by doing

[\u0D80-\u0DFF&&[^\u0D92]]*

[a-z&&[^egi]] matches all characters from a to z except e, g and i.

Use lookaheads to implement set intersection:

(?x)(?:
     (?!\x{d92})
     [\x{d80}-\x{dff}]
)

That creates an atom that fits your criteria. Qualify at will.

I don't trust your \uXXXX notation. It is always a bad sign when you see something that uses it, because it is some ancient Unicode 1 legacy notation that assumes Plane 0 only. That means it hasn't been useful since Unicode 2, way back deep into the previous millennium. I would avoid it if at all possible, because you don't want into get into bad habits that don't work for 16/17th of the Unicode namespace.

I have therefore used the standard \x{...} notation used in Java 7, ICU, and Perl, which is not bigotted against Planes 1-16 of Unicode. Indeed, in the languages acursed with a UTF-16 representation (yes, Java, I'm looking at you), that is the only possible way to do non-BMP ranges.

继续阅读：.net regex unicode

Excluding some characters from a Regular Expression range

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？