开发者

Why doesn't the match operator match anything?

I'm trying to parse this HTML block:

<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC开发者_JAVA百科7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">

to capture the redirect link:

/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=

and video title:

The Valley Downs Chicago

When I use this simple Perl code:

foreach $_ (@promotedVideos)
{
   if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/six)
   {
     print $1;
     print $2;
   }
}

nothing prints. While I'm troubleshooting this, I thought I'd ask you the experts if you see anything wrong or problematic. Thanks so much in advance for your help!


Your /x regex modifier messes something with whitespaces. Remove it.

That is, it should be

if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si)

/x makes perl ignore whitespaces inside regex, making your regex equivalent of following:

/\s<divclass="v120WrapperInner"><a href="([^"]*)"title="([^"]*)"><img/six

that will not match.

Also that \s at the beginning may brake things.

This is the code I've used for testing:

use strict;


my $inp = '<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">';

print "$inp\n";

if ( $inp =~ /<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si )
{
 print "m:\n$1\n$2\n";
}


Okay, this is not exactly what you are asking, but I think (based in this and your older question) that you are parsing HTML.

Let me tell you this: regexes aren't the solution. You should use HTML::TreeBuilder to parse HTML documents, because HTML documents are horribly messy.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $div ($root->find_by_tag_name('div')) {
    if ($div->attr('class') eq 'v120WrapperInner') {
        foreach (my $a = $div->find_by_tag_name('a')) {
            print "m:\n", $a->attr('href'), "\n", $a->attr('title'), "\n";
        }
    }
}


It is good that you are gaining experience with regex in perl, but for this type of work you might consider using a DOM parser like XML::DOM.


G'day,

If you're having problems understanding regexp's can I suggest having a read of the regexp intro in Dale Dougherty's excellent book "sed & awk" (sanitised Amazon link).

Definitely one of the best intro's to regexp's around.

HTH

cheers,

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜