开发者

perl html parsing lib/tool

Is there some powerful tools/libs 开发者_StackOverflow中文版for perl like BeautifulSoup to python?

Thanks


HTML::TreeBuilder::XPath is a decent solution for most problems.


I never used BeautifulSoup, but from quick skim over its documentation you might want HTML::TreeBuilder. It can process even broken documents well and allows traverse over parsed tree or query items - look at look_down method in HTML::Element.

If you like/know XPath, see daxim's recommendation. If you like to pick items via CSS selector, have a look at Web::Scraper or Mojo::DOM.


As you're looking for power, you can use XML::LibXML to parse HTML. The advantage then is that you have all the power of the fastest and best XML toolchain (excecpt MSXML, which is MS only) available to Perl to process your document, including XPath and XSLT (which would require a re-parse if you used another parser than XML::LibXML).

use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
    my( $subref, $bufref ) = @_;
    open my $fhbuf, '>', $bufref;
    local *STDERR = $fhbuf;
    $subref->(); # execute code that needs to be shut up
    return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
    sub { $dom = $parser->load_html( location => $url ) }, # code
    \my $errmsg; # buffer
# Now process document as XML.
my @nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for @nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜