Perl::Mechanize: running a simple crawler with a loop [multiple queries]

2023-03-06 13:46 问答作者：

currently ironing out a way to parse the data of a page: http://www.foundationfinder.ch/

i love to do it in Perl: Well - i am just musing which is the best way to do the job. Guess that i am in front of a nice learning curve. ;) This task will give me some nice Perl lessions. At the moment it goes abit over my head...;-)

So here is a sample-page:

... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html

i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?

BTW; But if i am going the PHP way i could do it with Curl - couldnt i!?

Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly

As noted above: here we have some resultpages;

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLangua开发者_如何学编程ge=1&Type=Html

Alternatively we can add a request_prepare handler that computes and add the query arguments before we send out the request.

Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database

should i define a extern_uid !?

and go like this:

for my $i (0..10000) {
  $ua->get('http://www.foundationfinder.ch/ShowDetails.php?Id=', id => 21, extern_uid => $i);
  # process reply
}

Well but now i get stuck- i need help - can i do the job like this!?

regards

zero

Dont do it like this. Use HTTP live headers (Firefox Plugin) or eqv. to see what the javasript does behind the scenes while you select what you need from here to get to that page (with the table).

To get the data from the table, use HTML::TableExtract or HTML::TreeBuilder::XPath if you want to use XPath

If you do want to iterate over the queries, just create another var:

my $a = 'http://www.foundationfinder.ch/ShowDetails.php?Id=' . $q . '&InterfaceLanguage=&Type=Html';

and increment $q as you go, make sure the page is valid before trying to load it with get

继续阅读：lwp mechanize parsing perl

Perl::Mechanize: running a simple crawler with a loop [multiple queries]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？