开发者

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.

For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.

I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).

Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".

The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:

preg_match('/\m(.{1})ori/i',$page_title)

Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" a开发者_Python百科nd how should I construct the regex to pick them up?

Cheers Tama


  1. Use the /u modifier for utf-8 mode in regexes,
  2. You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.


One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.

The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.

The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜