fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

2023-01-20 02:57 问答作者：

A triple job: I have to do a job with tree task. We have three tasks:

Fetch pages
Parse HTML
Store data... And yes - this is a true Perl-job!

I have to do a parser-job on all 6000 sub-pages of a site in suisse. (a governmental site - which has very good servers ).

see http://www.educa.ch/dyn/79362.asp?action=search and

(if you do not see approx 6000 results - then do a search with .

A detailed page is 开发者_Python百科like this:

[link text][1]

Ecole nouvelle de la Suisse Romande Ch. de Rovéréaz 20 Case postal 161 1000 Lausanne 12 Website info@ensr.ch Tel:021 654 65 00 Fax:021 654 65 05

another detailed pages shows this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - "><title>educa.ch</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><script src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"><table cellspacing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></td><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz</td><td width="20" class="popuphead" valign="middle"><a href="#" title="Print" onclick="window.print(); return false;"><img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"></a></td><td width="20" class="popuphead" valign="middle"><a href="#" title="close" onclick="window.close(); return false;"><img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"></a></td></tr><tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="1" height="1"></td></tr></table><div class="leerzeile">&#160;</div><div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Auseklis - Schule für lettische Sprache und Kultur</div><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width="15" height="8">Mutschellenstrasse 37</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">8002&#160;Zürich</div><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width="15" height="8"><a href="http://latvia.yourworld.ch" target="_blank">latvia.yourworld.ch</a></div><div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto: schorderet@inbox.lv">schorderet@inbox.lv</a></div><div class="leerzeile">&#160;</div><div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">+41786488637</div><div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8"></div><div>&#160;</div></body></html>

I want to do this job with ** HTML::TokeParser or HTML::TokeParser** or *HTML::TreeBuilder::LibXML * but i have little experience with HTML::TreeBuilder::LibXML

Which one would you prefer for this job: Note - I want to store the results in a MySQL-DB. Best things would be to store it immitiately after parsing:

so we have three tasks:

Fetch pages
Parse HTML
Store data

First item: Use LWP::UserAgent to fetch. There are many examples in this forum of using that module to post data and get the resulting pages. BTW we can use Mechanize instead if we prefer.

Second: Parse the page as eg with HTML::TokeParser or some other module to get at only the data we need.

Third: Store the data straight away into a database. There is no need to take an intermediate step and write a temporary file.

hmmm - the first and the second question - how to fetch and how to parse.

Hard to be too specific as your question is very general. I've retrieved pages using LWP and used TokeParser to extract data and store the output in a database many times. I haven't used Mech, but by all accounts it is simpler than LWP.

Creating a user agent using LWP can be as simple as:

my $ua = LWP::UserAgent->new();

you will need to consider things like re-directs, proxies and cookies or passwords depending on your requirements.

To follow re-directs:

$ua = LWP::UserAgent->new(
    requests_redirectable =>   ['GET', 'HEAD', 'POST' ]
);

To store cookies:

$ua->cookie_jar( {} );

To set up a proxy:

$ua->proxy("http", "http://localhost:8888");  # Fiddler

To add a password for authentication:

$ua->credentials( 'www.myhostingplace.com:443' , 'Realm' , 'userid', 'password');

To get content from a page for local processing:

$url = 'http://www.someurl.com'
my $response  = $ua->get($url);
if ( $response->is_error() ) {
   # Do some error stuff
}
my $content = $response->content();

To parse the content using TokeParser:

my $stream = new HTML::TokeParser(\$content);

while ( my $t = $stream->get_token() ) {
   if ( $t->[0] eq 'S' and $t->[1] eq 'input' ) {
      if ( uc( $t->[2]{ 'name' } ) eq 'SEARCHVALUE' ) {
           my $data = $t->[2]{ 'value' };
           # Do something with data
      }
   }
}

The data is passed into TokeParser as a reference; I then walk through the stream using get token. Each HTML element is passed into an array which you can examine to determine what you should do next.

In the above example I want to search for input tags with an attribute name of 'SEARCHVALUE' and then store the 'value' attribute. The HTML fragment might look something like this:

<input type="hidden" name="SEARCHVALUE" value="Spock" />

When I hit the start of the input tag ($t->[0] eq 'S' and $t->[1] eq 'input') I examine the "name" attribute of the tag (t->[2]{ 'name' }) to see if it matches the value I am searching for; if it does I store the value attribute of the tag ($t->[2]{ 'value' }) in a variable. I can then do whatever I like with the value including storing it in a database.

You can do a lot with TokeParser and in some cases it can be simpler than using regular expressions to carve up the page but it can also be a little challenging to get your head around. If you are trying to extract a simple pattern from the return HTML content then a regular expression can be just as good.

If you have a lot of this to do then I recommend "Perl and LWP" by Sean Burke from O'Reilly. It has been endlessly helpful for me in my web scraping endeavours.

Hope this helps you get started at least.

继续阅读：html-parsing lwp

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？