how to extract specific information from html webpage using perl
if the information of "XYZ 81.6 (-0.1)" needed to be extracted from one html webpage, how can it be done with perl? Many thanks.
<table border="0开发者_开发问答" width="100%">
<caption valign="top">
<p class="InfoContent"><b><br></b>
</caption>
<tr>
<td colspan="3"><p class="InfoContent"><b>ABC</b></td>
</tr>
<tr>
<td valign="top" height="61" width="31%">
<p class="InfoContent"><b><font color="#0000FF">XYZ 81.6 (-0.1) <br>22/06/2011</font></b></p>
</td>
</tr></table>
I would use HTML::TreeBuilder::XPath for this (and yes, it is a shameless plug!):
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $t= HTML::TreeBuilder::XPath->new_from_file( shift @ARGV);
my $text= $t->findvalue( '//p[@class="InfoContent"]/b/font[@color="#0000FF"]');
$text=~ s{\).*}{)};
print "found '$text'\n";
It is quite fragile though: as far as I can tell the only way to narrow down the XPath expression to just what you want is to use the font
tag. That is likely to change in the future, so if (when!) the code breaks, that's where you'll have to look first.
You can use something like that:
bash-3.2$ perl -MLWP::Simple -le ' $current_value = get("http://stackoverflow.com/questions/6454398/how-to-extract-specific-information-from-html-webpage-using-perl"); if ($current_value=~/(XYZ\s\d+\.\d+\s\(.*?\))/s) { print "Matched pattern is:\t $1";} '
Matched pattern is: XYZ 81.6 (-0.1)
Mirod's answer is awesome. This being Perl, I'll throw another approach out there.
Let's assume you have the HTML file in input.html
. Here's a Perl program which uses the HTML::TreeBuilder
module to extract the text:
#!/usr/bin/perl
use 5.10.0 ;
use strict ;
use warnings ;
use HTML::TreeBuilder ;
my $tree = HTML::TreeBuilder -> new () ;
$tree -> parse_file ( 'input.html' ) ;
my $text = ($tree -> address ( '0.1.0.2.0.0.0.1' ) -> content_list ()) [0] ;
say $text ;
Running it:
/tmp/tmp $ ./_extract-a.pl
XYZ 81.6 (-0.1)�
So how did I come up with that '0.1.0.2.0.0.0.1' magic number? Each node in the tree that results from parsing the HTML file has an "address". The text that you are interested has the address '0.1.0.2.0.0.0.1'.
So, how do you display the node addresses? Here's a little program I call treebuilder-dump
; when you pass it an HTML file, it displays it with the nodes labeled:
#!/usr/bin/perl
use 5.10.0 ;
use strict ;
use warnings ;
use HTML::TreeBuilder ;
my $tree = HTML::TreeBuilder->new ;
if ( ! @ARGV == 1 ) { die "No file provided" ; }
if ( ! -f $ARGV[0] ) { die "File does not exist: $ARGV[0]" ; }
$tree->parse_file ( $ARGV[0] ) ;
$tree->dump () ;
$tree->delete () ;
So for example, here's the output when run on your HTML snippet:
<html> @0 (IMPLICIT)
<head> @0.0 (IMPLICIT)
<body> @0.1 (IMPLICIT)
<table border="0" width="100%"> @0.1.0
<caption valign="top"> @0.1.0.0
<p class="InfoContent"> @0.1.0.0.0
<b> @0.1.0.0.0.0
<br /> @0.1.0.0.0.0.0
<tr> @0.1.0.1
<td colspan="3"> @0.1.0.1.0
<p class="InfoContent"> @0.1.0.1.0.0
<b> @0.1.0.1.0.0.0
"ABC"
<tr> @0.1.0.2
<td height="61" valign="top" width="31%"> @0.1.0.2.0
<p class="InfoContent"> @0.1.0.2.0.0
<b> @0.1.0.2.0.0.0
" "
<font color="#0000FF"> @0.1.0.2.0.0.0.1
"XYZ 81.6 (-0.1)�"
<br /> @0.1.0.2.0.0.0.1.1
"22/06/2011"
" "
You can see that the text you're interested in is located within the font color
node which has address 0.1.0.2.0.0.0.1
.
精彩评论