开发者

Ruby, Scrape page written entirely in JavaScript

I am playing with Ruby + Hpricot and building a simple scraper. I am able to work with other sites with no issues. But if a page is written entirely in JavaScript, can that be scraped? but, google search results pages now seem to be entirely JavaScript based except a few internal links.

  • Can pages written like this not be scraped by regular tools like Mechanize & Hpricot (My guess is they can't)

  • Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?

T开发者_如何学JAVAhanks!

Edit: Thanks for your responses. I realize scraping google directly is not right, there is an API in place and that can be used. At the core of the question what I wanted to find-out really was if there is a page written entirely in javascript(including text and contents - could be obfuscated.) is there a gem that will try to Render the page in only text and then get its text contents?


This is very, very important, so listen carefully:

Always check 'robots.txt', first, and don't scrape if it tells you not to!

If you look at http://www.google.com/robots.txt, you will clearly see this line:

Disallow: /search

Edit (based on asker's comments)

Setting aside the 'robots.txt' issue for a moment, you are probably much better off learning using a simpler website, anyway. I'd suggest using a website or two that doesn't change often, so you can easily reproduce your results and verify that everything is working as you expect it to.


Here's a link for you that turns off instant loading.
http://www.google.com/webhp?hl=en&tab=ww&complete=0

  • Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?

You can use PhantomJS (C++) or PyPhantomJS (Python) for screen scraping if you want.

PyPhantomJS also has a really nice plugin system which the C++ one doesn't.

There's also a scraping library that someone just released for it.
Google Groups post | GitHub address

Note: As others have said though, Google doesn't want people to scrape their search results. I suggest complying with their Terms of Service.


You should have a look at Google's TOS. Scraping their search results is not allowed. Use their search API.


If you scrape Google, you absolutely must use proxies, at least 100+. Otherwise they'll easily ban your IP address.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜