How do I define a libpcre regexp for arabic characters?

2023-03-05 08:55 问答作者：

I need to define a PCRE regexp for certain spam-ish words in Arabic/Persian alphabet to be used in drupal spam module. The problem is that the usual PCRE regexp is apparently unable to find patters in Arabic alphabets.

For example, while /bad word/ flags开发者_开发百科 instances of 'bad word', but

/کلمه بد/i

Is unable to flag 'کلمه بد'.

I have no problem with that if I use the u (Unicode) PCRE modifier:

$string = 'کلمه بد';

if (preg_match('~\p{Arabic}~u', $string) > 0)
{
    var_dump('contains Arabic characters');

    if (preg_match('~کلمه بد~ui', $string) > 0)
    {
        var_dump('contains spam-ish Arabic characters');
    }
}

string(26) "contains Arabic characters"
string(35) "contains spam-ish Arabic characters"

It runs just fine on IDEOne.com too. Be sure to save your files (and convert input data) in (to) UTF-8.

Literal Unicode text in Perl source will only be recognized properly if the source file has use utf8; in it.

You can do /\x{644}/ and you can do

open my $fh, '<:utf8', 'somefile.txt' or die "blah blah";
my $bad_thing = <$fh>;
/$bad_thing/;

and either will work without the utf8 pragma if your data is properly decoded, but if you want to do /ل/ then you need use utf8. Make sense?

继续阅读：arabic drupal-6 php regex utf-8

How do I define a libpcre regexp for arabic characters?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？