Preg_Replace and UTF8
I'm enhancing our video search page to highlight the search term(s) in the results. Because user can enter judas priest
and a video has Judas Priest
in it's text I have to use regular expressions to preserve the case of the original text.
My code works, but I have problems with special characters like š, č and ž
, it seems that Preg_Replace()
will only match if the case is the same (despite the /ui
modifier).
My code:
$Content = Preg_Replace ( '/\b(' . $term . '?)\b/iu', '<span class="HighlightTerm">$1</span>', $Content );
I also tried this:
$Content = Mb_Eregi_Replace ( '\b(' . $term . '?)\b', '<span class="HighlightTerm">\\1</span>', $Content );
But it also doesn't work. It will match "SREČA" if the search term is "SREČA", but if the search term is "sreča" it will not match it (and vice versa).
So how do I ma开发者_JS百科ke this work?
update: I set the locale and internal encoding:
Mb_Internal_Encoding ( 'UTF-8' );
$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);
I feel really stupid right about now but the problem wasn't with Preg_* functions at all. I don't know why but I first checked if the given term is even in the string with StriPos
and since that function is not multi-byte safe it returned false
if the case of the text was not the same as the search term, so the Preg_Replace
wasn't even called.
So the lesson to be learned here is that always use multi-byte versions of functions if you have UTF8 strings.
Not sure what your problem is stemming from, but I just put together this little test case:
<?php
$uc = "SREČA";
mb_internal_encoding('utf-8');
echo $uc."\n";
$lc = mb_strtolower($uc);
echo $lc."\n";
echo preg_replace("/\b(".preg_quote($uc).")\b/ui", "<span class='test'>$1</span>", "test:".$lc." end test");
It's output on my machine:
SREČA
sreča
test:<span class='test'>sreča</span> end test
Seems to be working properly?
If I'm not mistaken, preg_match
uses the current locale. Try setting the locale to the language which these characters belongs to. You probably need a utf8 based locale too. If you have mixed languages in your page, you may be able to find a generic international locale that works.
See also: http://www.phpwact.org/php/i18n/utf-8
精彩评论