Ruby, Scrape page written entirely in JavaScript

2023-03-18 01:54 问答作者：

I am playing with Ruby + Hpricot and building a simple scraper. I am able to work with other sites with no issues. But if a page is written entirely in JavaScript, can that be scraped? ~~but, google search results pages now seem to be entirely JavaScript based except a few internal links.~~

Can pages written like this not be scraped by regular tools like Mechanize & Hpricot (My guess is they can't)
Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?

T开发者_如何学JAVAhanks!

Edit: Thanks for your responses. I realize scraping google directly is not right, there is an API in place and that can be used. At the core of the question what I wanted to find-out really was if there is a page written entirely in javascript(including text and contents - could be obfuscated.) is there a gem that will try to Render the page in only text and then get its text contents?

This is very, very important, so listen carefully:

Always check 'robots.txt', first, and don't scrape if it tells you not to!

If you look at http://www.google.com/robots.txt, you will clearly see this line:

Disallow: /search

Edit (based on asker's comments)

Setting aside the 'robots.txt' issue for a moment, you are probably much better off learning using a simpler website, anyway. I'd suggest using a website or two that doesn't change often, so you can easily reproduce your results and verify that everything is working as you expect it to.

Here's a link for you that turns off instant loading.
http://www.google.com/webhp?hl=en&tab=ww&complete=0

Are they tools/gems available that may probably try to render the page(like a browser) and then collect data?

You can use PhantomJS (C++) or PyPhantomJS (Python) for screen scraping if you want.

PyPhantomJS also has a really nice plugin system which the C++ one doesn't.

There's also a scraping library that someone just released for it.
Google Groups post | GitHub address

Note: As others have said though, Google doesn't want people to scrape their search results. I suggest complying with their Terms of Service.

You should have a look at Google's TOS. Scraping their search results is not allowed. Use their search API.

If you scrape Google, you absolutely must use proxies, at least 100+. Otherwise they'll easily ban your IP address.

继续阅读：hpricot ruby screen-scraping

Ruby, Scrape page written entirely in JavaScript

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？