Ruby, Scrape page written entirely in JavaScript
I am playing with Ruby + Hpricot and building a simple scraper. I am able to work with other sites with no issues. But if a page is written entirely in JavaScript, can that be scraped? but, google search results pages now seem to be entirely JavaScript based except a few internal links.
Can pages written like this not be scraped by regular tools like Mechanize & Hpricot (My guess is they can't)
Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?
T开发者_如何学JAVAhanks!
Edit: Thanks for your responses. I realize scraping google directly is not right, there is an API in place and that can be used. At the core of the question what I wanted to find-out really was if there is a page written entirely in javascript(including text and contents - could be obfuscated.) is there a gem that will try to Render the page in only text and then get its text contents?
This is very, very important, so listen carefully:
Always check 'robots.txt', first, and don't scrape if it tells you not to!
If you look at http://www.google.com/robots.txt, you will clearly see this line:
Disallow: /search
Edit (based on asker's comments)
Setting aside the 'robots.txt' issue for a moment, you are probably much better off learning using a simpler website, anyway. I'd suggest using a website or two that doesn't change often, so you can easily reproduce your results and verify that everything is working as you expect it to.
Here's a link for you that turns off instant loading.
http://www.google.com/webhp?hl=en&tab=ww&complete=0
- Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?
You can use PhantomJS (C++) or PyPhantomJS (Python) for screen scraping if you want.
PyPhantomJS also has a really nice plugin system which the C++ one doesn't.
There's also a scraping library that someone just released for it.
Google Groups post | GitHub address
Note: As others have said though, Google doesn't want people to scrape their search results. I suggest complying with their Terms of Service.
You should have a look at Google's TOS. Scraping their search results is not allowed. Use their search API.
If you scrape Google, you absolutely must use proxies, at least 100+. Otherwise they'll easily ban your IP address.
精彩评论