开发者

Writing pseudo-crawler for web statistics

I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "Automobile", "Stock exchange", "Liposuction"...) and request the first 300 or so pages found.

I want the process to be very fast yet I do not want to be banned by google. Surely I want to minimize development time when possible. Maybe some stupid Perl script.

Is there any ready-made solution that I can and should reuse? With Google I did not find anything 开发者_C百科suitable cause what I want to measure is not part of HTML yet resides in HTML files.


wget can do just about everything, including limiting your request rate.


HTTrack is also pretty good and easy to use. Has a nice GUI and a lot of options.

Source is also available if you're looking for inspiration: here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜