Subtract from character class

2023-01-28 13:51 问答作者：

Is there a way to subtract characters or a character range from another character class?

I need to find a substring within a string, which should only contain characters, but without "<" and ">".

[[:print:]] - ('<' | '>')

Its because "<" and ">" are delimiters and should not occur within the s开发者_运维知识库tring itself.

<abc> // valid
<ab<c> // invalid
<ab\tc> //invalid

[:print:] is equivalent to [\x20-\x7E] so if you don't want < (\x3C) and > (\x3E), you can do [\x20-\x3B\x3D\x3F-\x7E]

this will match printable characters in a string except < and >

/[\x20-\x3B\x3D\x3F-\x7E]+/

In regular expressions, you can easily do union, intersection, and subtraction of character classes.

[a[b]]

is the union.

[a&&b]

is the intersection.

[a&&[^b]]

is the subtraction.

I regularly do rather complex set operations in Java. For example, this is what you have to use in Java

[^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

for a modern version of \w. (You don’t have to do that in Perl, since \w isn’t broken there the way it is in Java.) Word boundaries get a tad harder:

(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

But at least now you have a \b that works in Java, not a broken thing that screws up everything you do. To implement \X in languages that don’t have it, you can either use a legacy grapheme cluster, defined as:

(?>\PM\pM*)

Or you can use an extended grapheme cluster, defined as (or nearly as, actually):

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

Of course, you don’t have to go through such extreme rewrites if you happen to be using a language with the radical notion of actually supporting their own native character set!

Unfortunately, Java is not one of those.

For regexes, I suggest using something more modern, like Perl, Python, or Ruby. Because otherwise you’re stuck in the Stone Age.

继续阅读：php regex

Subtract from character class

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？