Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

2023-01-10 01:57 问答作者：

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.

For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.

I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).

Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".

The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:

preg_match('/\m(.{1})ori/i',$page_title)

Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" a开发者_Python百科nd how should I construct the regex to pick them up?

Cheers Tama

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.

The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.

The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

继续阅读：preg-match preg-replace regex utf-8

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？