How exactly does the "parent" function from HTML::TreeBuilder work?
The documentation on CPAN doesn't really explain this behavior unless I'm missing something. I've put together some quick test code to illustrate my problem:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $testHtml = "
<body>
<h1>
<p>
<p>HELLO!
开发者_如何学Python</p>
</p>
</h1>
</body>";
my $parsedPage = HTML::TreeBuilder->new;
$parsedPage->parse($testHtml);
$parsedPage->eof();
my @p = $parsedPage->look_down('_tag' => 'p');
foreach (@p) {print $_->parent->tag, " : ", $_->tag, "\t", $_->as_text, "\n";}
After running the above script, the output is:
body : p
body : p HELLO!
Seeing as all the tags are nested one after another, I would think that the parent of the first p
tag would be h1
, and the parent of the second p
tag would be p
. Why is the parent function showing the body
tag for both?
Your HTML is invalid. And given that HTML::TreeBuilder is a subclass of HTML::Parser, I can only assume that the parser is doing what it can to transform your document into valid HTML.
You can call $parsedPage->as_HTML to see what the parser has done to your HTML. It gives me this:
<html><head></head><body><h1></h1><p><p>HELLO! </body></html>
Perhaps you should pass your HTML through a validator or HTML::Tidy, before processing it.
精彩评论