开发者

What do I gain by filtering URLs through Perl's URI module?

Do I gain something when I transform my $url like this: $url = URI->new( $url )?

#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use URI;
use XML::LibXML;

my $url = 'http://stackoverflow.com/';
$url = URI->new( $url );

my $doc = XML::LibXML->lo开发者_运维技巧ad_html( location => $url, recover => 2 );
my @nodes = $doc->getElementsByTagName( 'a' );
say scalar @nodes;


The URI module constructor would clean up the URI for you - for example correctly escape the characters invalid for URI construction (see URI::Escape).


The URI module as several benefits:

  • It normalizes the URL for you
  • It can resolve relative URLs
  • It can detect invalid URLs (although you need to turn off the schemeless bits)
  • You can easily filter the URLs that you want to process.

The benefit that you get with the little bit of code that you show is minimal, but as you continue to work on the problem, perhaps spidering the site, URI becomes more handy as you select what to do next.


I'm surprised nobody has mentioned it yet, but$url = URI->new( $url ); doesn't clean up your $url and hand it back to you, it creates a new object of class URI (or, rather, of one if its subclasses) which can then be passed to other code which requires a URI object. That's not particularly important in this case, since XML::LibXML appears to be happy to accept locations as either strings or objects, but some other modules require you to give them a URI object and will reject URLs presented as plain strings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜