What do I gain by filtering URLs through Perl's URI module?
Do I gain something when I transform my $url
like this: $url = URI->new( $url )
?
#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use URI;
use XML::LibXML;
my $url = 'http://stackoverflow.com/';
$url = URI->new( $url );
my $doc = XML::LibXML->lo开发者_运维技巧ad_html( location => $url, recover => 2 );
my @nodes = $doc->getElementsByTagName( 'a' );
say scalar @nodes;
The URI module constructor would clean up the URI for you - for example correctly escape the characters invalid for URI construction (see URI::Escape).
The URI module as several benefits:
- It normalizes the URL for you
- It can resolve relative URLs
- It can detect invalid URLs (although you need to turn off the schemeless bits)
- You can easily filter the URLs that you want to process.
The benefit that you get with the little bit of code that you show is minimal, but as you continue to work on the problem, perhaps spidering the site, URI becomes more handy as you select what to do next.
I'm surprised nobody has mentioned it yet, but$url = URI->new( $url );
doesn't clean up your $url
and hand it back to you, it creates a new object of class URI
(or, rather, of one if its subclasses) which can then be passed to other code which requires a URI
object. That's not particularly important in this case, since XML::LibXML
appears to be happy to accept locations as either strings or objects, but some other modules require you to give them a URI
object and will reject URLs presented as plain strings.
精彩评论