Dom-Processing with Perl-Mechanize: finalizing a little programme

2023-03-06 23:37 问答作者：

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.

What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.

The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.

The inner loop through the results would use the follow link function to get to the actual entries using the following call.

$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);

This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.

If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods. Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).

But wait: the processing of the entry pages will then be the most complex part of the script.

Approaches: In principle we could do the same algorithm with a single while loop if we use the back() function smartly.

Can you give me a hint for the beginning — the processing of the en开发者_运维问答try pages — doing this in Perl:: Mechanize?

Here's what I have:

 GetThePage(
    starting url 
);
sub GetThePage {
    my $mech ...
    my @pages = ...
    while(@pages) {
       my $page = shift @pages;
       $mech->get( $page );
       push @pages, GetMorePages( $mech );
       SomethingImportant( $mech );
       SomethingXPATH( $mech );
    }
}

The question is how to find the DOM-paths.

Use Firebug, Opera Dragonfly, Chromium Developer tools.

Dom-Processing with Perl-Mechanize: finalizing a little programme

Call the context menu on the indicated element to copy an XPath expression or CSS selector (useful for Web::Query) to clipboard.

Really you want to use Web::Scraper for this kind of thing.

继续阅读：dom mechanize parsing perl relative-path

Dom-Processing with Perl-Mechanize: finalizing a little programme

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？