Writing pseudo-crawler for web statistics
I'm tasked with writing a web pseudo-crawler to calculate certain statistics. I need to measure the percentage of html files that start with <DOCTYPE
against the number of html files that do not have it and compare this statisitic between sites on different subjects. To do so the idea is to search with google for different terms (like "Automobile", "Stock exchange", "Liposuction"...) and request the first 300 or so pages found.
I want the process to be very fast yet I do not want to be banned by google. Surely I want to minimize development time when possible. Maybe some stupid Perl script.
Is there any ready-made solution that I can and should reuse? With Google I did not find anything 开发者_C百科suitable cause what I want to measure is not part of HTML yet resides in HTML files.
wget can do just about everything, including limiting your request rate.
HTTrack is also pretty good and easy to use. Has a nice GUI and a lot of options.
Source is also available if you're looking for inspiration: here
精彩评论