How can I get the file extensions from relative links in HTML text using Perl?
For example, scanning the contents of an HTML page with a Perl regular expression, I want to match all file extensions but n开发者_如何学JAVAot TLD's in domain names. To do this I am making the assumption that all file extensions must be within double quotes.
I came up with the following, and it is working, however, I am failing to figure out a way to exclude the TLDs in the domains. This will return "com", "net", etc.
m/"[^<>]+\.([0-9A-Za-z]*)"/g
Is it possible to negate the match if there is more than one period between the quotes that are separated by text? (ie: match foo.bar.com but not ./ or ../)
Edit I am using $1
to return the value within parentheses.
#!/usr/bin/perl
use strict; use warnings;
use File::Basename;
use HTML::TokeParser::Simple;
use URI;
my $parser = HTML::TokeParser::Simple->new( \*DATA );
while ( my $tag = $parser->get_tag('a') ) {
my $uri = URI->new( $tag->get_attr('href') );
my $ext = ( fileparse $uri->path, qr/\.\w+\z/ )[2];
print "$ext\n";
}
__DATA__
<p><a href="../test.png">link</a> <a
href="http://www.example.com/test.jpg">link on example.com</a>
</p>
First of all, extract the names with an HTML parser of your choice. You should then have something like an array containing the names, as if produced like this:
my @names = ("http://foo.bar.net/quux",
"boink.bak",
"mms://three.two.one"
"hello.jpeg");
The only way to distinguish domain names from file extensions seems to be that in "file names", there is at least one more slash between the ://
part and the extension. Also, a file extension can only be the last thing in the string.
So, your regular expression would be something like this (untested):
^(?:(?:\w+://)?(?:\w+\.)+\w+/)?.*\.(\w+)$
#!/usr/bin/perl -w
use strict;
while (<>) {
if (m/(?<=(?:ref=|src=|rel=))"([^<>"]+?\.([0-9A-Za-z]+?))"/g) {
if ($1 !~ /:\/\//) {
print $2 . "\n";
}
}
}
Used positive lookbehind to get only the stuff between doublequotes behind one of the 'link' attributes (scr=, rel=, href=). Fixed to look at "://" for recognizing URL's, and allow files with absolute paths.
@Structure : There is no proper way to protect against someone leaving off the protocol part, as it would just turn into a legitimate pathname : http://www.noo.com/afile.cfg -> www.noo.com/afile.cfg. You would need to wget (or something) all of the links to make sure they are actually there. And that's an entirely different question...
Yes, I know I should use a proper parser, but am just not feeling like it right now :P
精彩评论