How do I extract information from a webpage using perl?

2023-04-01 19:16 问答作者：

I need to extract the largest values(number) of specific names from a webpage. consider the webpage as

 http://earth.wifi.com/isos/preFCS5.3/upgrade/

and the URL content is

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /isos/preFCS5.3/upgrade</title>
 </head>
 <body>
<h1>Index of /isos/preFCS5.3/upgrade</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr><tr><th colspan="5"><h开发者_高级运维r></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td><a href="/isos/preFCS5.3/">Parent Directory</a></td><td>&nbsp;</td><td align="right">  - </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="GTP-UPG-LATEST-5.3.0.160.iso">GTP-UPG-LATEST-5.3.0.160.iso</a></td><td align="right">29-Aug-2011 16:06  </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="GTP-UPG-LATEST-5.3.0.169.iso">GTP-UPG-LATEST-5.3.0.169.iso</a></td><td align="right">31-Aug-2011 16:26  </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="GTP-UPG-LATEST-5.3.0.172.iso">GTP-UPG-LATEST-5.3.0.172.iso</a></td><td align="right">01-Sep-2011 16:26  </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="PRE-UPG-LATEST-5.3.0.157.iso">PRE-UPG-LATEST-5.3.0.157.iso</a></td><td align="right">29-Aug-2011 16:05  </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="PRE-UPG-LATEST-5.3.0.165.iso">PRE-UPG-LATEST-5.3.0.165.iso</a></td><td align="right">31-Aug-2011 16:26  </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="PRE-UPG-LATEST-5.3.0.168.iso">PRE-UPG-LATEST-5.3.0.168.iso</a></td><td align="right">01-Sep-2011 16:26  </td><td align="right">1.5G</td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.2.3 (Red Hat) Server at earth.wifi.com Port 80</address>
</body></html>

In the above source you can see 172 is the largest for GTP-UPG-LATEST-5.3.0 and 168 is the largest for PRE-UPG-LATEST-5.3.0

How can I extract these values and put it to a varialble say $gtp and $pre in perl

Thanks so much in advance

#!/usr/bin/perl 

use strict;
use warnings;

use LWP::Simple; 

my $upgrade = 'http://earth.wifi.com/isos/preFCS5.3/upgrade/';
my $website_content = get($upgrade);
if ( $website_content =~ /href=\"PRE-UPG-LATEST-5.3.0(.*?)\.iso\"/ ) 

{

my $preversion = ${1};

print $preversion;
}

This is the code I tried with but its not getting the largest value. This is code is getting the first PRE-UPG-LATEST version value that it encounters . But I need the largest of the value

An if() executes only once. Since you want to get many, you need a loop

while ( m//g ) {

In your data it has "UPG" but your regex has "UGP", so it won't match (you should copy/paste long strings rather than (attempt to) retype them!).

This will list the data you need, I'll leave it to you to figure out how to process it.

while ($website_content =~ /href="((?:PRE|GTP)-UPG-LATEST-.*?)\.(\d+)\.iso"/g) {
    my($file, $version) = ($1, $2);
    print "file=$file version=$version\n";
}

I would suggest that you not only use LWP::Simple, but XML::Simple too. This will allow you to example the data as a hash of hashes. It'll be a lot easier to find the largest version.

You can't parse HTML or XML with simple regular expressions because the XML data structure is too free form. Large structures can legally be broken up on separate lines. Take a look at this example:

<a href="http://foo.com/bar/bar/">The Foobar Page</a>

It can also be expressed as:

<a
     href="http://foo.com/bar/bar/">
     The Foobar Page
</a>

If you were looking for a href, you'll never find it. Heck, you could even look for a\s+href and not find it.

There might be better modules to use for parsing HTML (I found HTML::Dom), but I've never used them and don't know which one is the best one to use.

As for finding the largest version number:

There's some difficulty because there are all sorts of strange and wacky rules with version numbering. For example:

2.2 < 2.10

Perl has something called V-Strings, but rumor has it that they've been deprecated. If this doesn't concern you, you can use Perl::Version.

Otherwise, here's a subroutine that does version comparison. Note that I also verify that each section is an integer via the /^\d+$/ regular expression. My subroutine can return four values:

0: Both are the same size
1: First Number is bigger
2: Second Number is bigger
undef: There is something wrong.

Here's the program:

my $minVersion  = "10.3.1.3";
my $userVersion = "10.3.2";

# Create the version arrays

my $result = compare($minVersion, $userVersion);

if (not defined $results) {
    print "Non-version string detected!\n";
}
elsif ($result == 0) {
print "$minVersion and $userVersion are the same\n";
}
elsif ($result == 1) {
print "$minVersion is bigger than $userVersion\n";
}
elsif ($result == 2) {
print "$userVersion is bigger than $minVersion\n";
}
else {
print "Something is wrong\n";
}


sub compare {

my $version1 = shift;
my $version2 = shift;

my @versionList1 = split /\./, $version1;
my @versionList2 = split /\./, $version2;

my $result;

while (1) {

    # Shift off the first value for comparison
    # Returns undef if there are no more values to parse

    my $versionCompare1 = shift @versionList1;
    my $versionCompare2 = shift @versionList2;

    # If both are empty, Versions Matched

    if (not defined $versionCompare1 and not defined $versionCompare2) {
    return 0;
    }

    # If $versionCompare1 is empty $version2 is bigger
    if (not defined $versionCompare1) {
    return 2;
    }
    # If $versionCompare2 is empty $version1 is bigger
    if (not defined $versionCompare2) {
    return 1;
    }

    # Make sure both are numeric or else there's an error
    if ($versionCompare1 !~ /\^d+$/ or $versionCompare2 !~ /\^\d+$/) {
    return;
    }

    if ($versionCompare1 > $versionCompare2) {
    return 1;
    }
    if ($versionCompare2 > $versionCompare1) {
    return 2;
    }
}
}

继续阅读：parsing perl

How do I extract information from a webpage using perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？