how can I get a unknown length string from a webpage
I need to get a string in perl whose length is varying each day. Look at the URL content below
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /isos/preFCS5.3/LATESTGOODCVP</title>
</head>
<body>
<h1>Index of /isos/preFCS5.3/LATESTGOODCVP</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr><tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td><a href="/isos/preFCS5.3/">Parent Directory</a></td><td> </td><td align="right"> - </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="CVP-LATEST-5.3.0.37.iso">CVP-LATEST-5.3.0.37.iso</a></td><td align="right">19-Jul-2011 03:32 </td><td align="right">816M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="ChangeLog-LATEST.2011-07-19-03h.30m.01s">ChangeLog-LATEST.2011-07-19-03h.30m.01s</a></td><td align="right">19-Jul-2011 03:32 </td><td align="right"> 16K</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="is.iso">is.iso</a></td><td align="right">19-Jul-2011 03:32 </td><td align="right">816M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="md5SUM">md5SUM</a></td><td align="right">19-Jul-2011 03:32 </td><td align="right">111 </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.2.3 (Red Hat) Server at www.google.com Port 80</address>
</body></html>
You can see a string named "CVP-LATEST-5.3.0.37.iso". I need to get开发者_如何转开发 that into $name. the string CVP-LATEST-5.3.0.37.iso will keep on changing everyday say CVP-LATEST-5.3.0.39.iso or CVP-LATEST-5.3.39a.iso or to CVP-LATEST-6.1.iso or CVP-LATEST-6.23.23.112.iso.
Is there any way I can get this ?
Here is the code
use strict;
use warnings;
use LWP::Simple;
my $oldVersion = CVP-LATEST-5.3.0.37.iso;
my $url = 'http://www.google.com/isos/preFCS5.3/LATESTGOODCVP/';
my $newPage = get($url)
or die "Cannot retrieve contents from $url\n";
if ( $newPage =~ /href=\"CVP-LATEST-5\.3\.0\.(\d\d)/ ) {
my $version = $1;
if ( $version != $oldVersion ) {
my $status = getstore($url . "CVP-LATEST-5.3.0.$version.iso",
"CVP-LATEST-5.3.0.$version.iso");
} else {
print "Already at most recent version\n";
}
} else {
die "Cannot find version tag in contents from $url\n";
}
Here if you see the code its getting only the number(xx) after 5.3.0."XX" and is of known length that is 2.
Is there anyway I can change it so that it will read the whole filename ie. CVP-LATEST-XXXXXX*.iso and then compare it with the $oldversion ?
Please note the string "CVP-LATEST-" and ".iso" remains constant, but later numbers change and can also contain alphabets. Also note that there is one more file called is.iso in the URL content. I don't want to get that.
You should use a module that knows how to parse HTML when you want to parse HTML.
Your Question is Asked Frequently:
perldoc -q url
How do I extract URLs?
use HTML::SimpleLinkExtor;
...
my $extor = HTML::SimpleLinkExtor->new();
$extor->parse($newPage);
my($version) = grep /^CVP-LATEST-.*\.iso/, $extor->href;
Try
if ( $newPage =~ /href=\"CVP-LATEST-(.*?)\.iso\"/ ) {
my $name = "CVP-LATEST-${1}.iso";
$name
contains the whole filename.
the secret to html regexes , not doublequote
/href="([^"]*)"/i
精彩评论