开发者

How do I define a libpcre regexp for arabic characters?

I need to define a PCRE regexp for certain spam-ish words in Arabic/Persian alphabet to be used in drupal spam module. The problem is that the usual PCRE regexp is apparently unable to find patters in Arabic alphabets.

For example, while /bad word/ flags开发者_开发百科 instances of 'bad word', but

/کلمه بد/i

Is unable to flag 'کلمه بد'.


I have no problem with that if I use the u (Unicode) PCRE modifier:

$string = 'کلمه بد';

if (preg_match('~\p{Arabic}~u', $string) > 0)
{
    var_dump('contains Arabic characters');

    if (preg_match('~کلمه بد~ui', $string) > 0)
    {
        var_dump('contains spam-ish Arabic characters');
    }
}

string(26) "contains Arabic characters"
string(35) "contains spam-ish Arabic characters"

It runs just fine on IDEOne.com too. Be sure to save your files (and convert input data) in (to) UTF-8.


Literal Unicode text in Perl source will only be recognized properly if the source file has use utf8; in it.

You can do /\x{644}/ and you can do

open my $fh, '<:utf8', 'somefile.txt' or die "blah blah";
my $bad_thing = <$fh>;
/$bad_thing/;

and either will work without the utf8 pragma if your data is properly decoded, but if you want to do /ل/ then you need use utf8. Make sense?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜