How to retrieve google pages
Dear all,I am now using a webtool
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=
to parse a webpage.
For example,we can parse newyork开发者_C百科times homepage,we do:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://www.nytimes.com/pages/world/index.html
in the address bar of our browser,it will parse things nicely for us.
However,it just fails for google pages. For example,if I want to parse Google news headpage,like:
http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http://news.google.com/nwshp?hl=en&tab=wn
I will always get 500 Internal Server Error.
I am sure that is somthing to do with google website,I think probably we need some API for google,does anyone have any idea how to to sort this out for google pages? Many thanks.
Per the google.com robots.txt file, you are explictly requested not to scrape their content. Google does not provide an API for machine-readable search results; they want to control the presentation of their content via widgets and embedding strategies.
精彩评论