Regexp word boundaries in non-ASCII situations

2023-02-24 14:20 问答作者：

I have a regular expression in my PHP script like this:

/(\b$term|$term\b)(?!([^<]+)?>)/iu

This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.

However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?

I can get almost as good result with

/(\s$term|$term\s)(?!([^<]+)?>)/iu

but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.

I've read this StackOverflow question about the problem, but it doesn't help - does开发者_StackOverflown't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).

Any way to make this work? Thanks!

You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?

The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?

继续阅读：php regex unicode

Regexp word boundaries in non-ASCII situations

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？