web scraping search results
I need help solving the following issue:
I need to validate cached URLs by Google search engine for a particular site. In the case the url will 404 or the page will not render some necessary html elements (considered broken) I need to log those URLs and later 301 redirect to correct URLs. I know PHP and a little 开发者_开发问答bit of Python but I'm not sure what approach to use to scrap all URLs from search engine results for given site.
http://simplehtmldom.sourceforge.net/ - a simple html parser. there is an example at this page; not sure if this still works with googles instant search etc.
精彩评论