开发者

How to detect Arabic chars using perl regex?

I'm parsing some html pages, and need to detect any Arabic char inside.. Tried various regexs, but no luck..

Does anyone know working way to do that?

Thanks


Here is the page I'm processing: http://pastie.org/2509936

And my code is:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($respons开发者_运维技巧e->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}


If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/

If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.


EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.


The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is [update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.


Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜