开发者

Dom-Processing with Perl-Mechanize: finalizing a little programme

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.

What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.

The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.

The inner loop through the results would use the follow link function to get to the actual entries using the following call.

$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);

This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.

If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods. Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).

But wait: the processing of the entry pages will then be the most complex part of the script.

Approaches: In principle we could do the same algorithm with a single while loop if we use the back() function smartly.

Can you give me a hint for the beginning — the processing of the en开发者_运维问答try pages — doing this in Perl:: Mechanize?

Here's what I have:

 GetThePage(
    starting url 
);
sub GetThePage {
    my $mech ...
    my @pages = ...
    while(@pages) {
       my $page = shift @pages;
       $mech->get( $page );
       push @pages, GetMorePages( $mech );
       SomethingImportant( $mech );
       SomethingXPATH( $mech );
    }
}

The question is how to find the DOM-paths.


Use Firebug, Opera Dragonfly, Chromium Developer tools.

Dom-Processing with Perl-Mechanize: finalizing a little programme

Call the context menu on the indicated element to copy an XPath expression or CSS selector (useful for Web::Query) to clipboard.


Really you want to use Web::Scraper for this kind of thing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜