开发者

How to catch Marz/März/März?

I'm trying to find the month in a text 开发者_Go百科written in German. (In an html file)

March is written "März".

I want to be sure that I catch it so I check

Marz, März, März

I tried to use this code

if(preg_match("/ma?ä?(ä)?rz/i", $title))
    return 3;

It works fine for the first two, but doesn't with ä. What did I do wrong ?

(The HTML and my PHP files are encoded in UTF8)


Why not just try

(Marz|März|März)


You have to first decode the entities, then use a comparison that works with the Unicode Collation Algorithm. For example, this works in Perl:

use Unicode::Collate;

my $Collator = Unicode::Collate->new(normalization => undef, level => 1);
my $str = "Ich muß Perl studieren.";
my $sub = "MÜSS";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
    $match = substr($str, $pos, $len);
}

Matching things with and without marks is possible according to what level of comparison you wish done.

How you perform basic Unicode operations like this in PHP I do not know, but I figure there must be a corresponding library, given how necessary these types of things are.


ä is more than one byte or something like that - you have to do this:

preg_match("/ma?(ä)?(ä)?rz/i", $title);

You can see it here.

Besides, Kengs approach is better.


If it's just for searching purposes but not for returning the actual position of the word, you could normalize the search string using html_entity_decode() and iconv():

$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
$string = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $string);

// then search for "Marz"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜