Web crawler that can interpret JavaScript [closed]

2022-12-27 16:04 问答作者：

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 6 years ago.

开发者_Python百科 Improve this question

I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.

How would I go about doing this? What tools exist that would help me?

Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:

require 'capybara/dsl'
require 'capybara-webkit'

include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)

I've been using HtmlUnit (Java). This was originally designed for unit testing pages. It's not perfect javascript, but it hasn't failed me in my limited usage. According to the site, it can run the following JS frameworks to a reasonable degree:

jQuery 1.2.6
MochiKit 1.4.1
GWT 2.0.0
Sarissa 0.9.9.3
MooTools 1.2.1
Prototype 1.6.0
Ext JS 2.2
Dojo 1.0.2
YUI 2.3.0

You are more likely to have success in Java than in PHP. There is a pre-existing Javascript interpreter for Java called Rhino. It's a reference implementation, and well-documented.

Rhino is used in lots of existing Java apps to provide Javascript scripting ability within the app. I have also heard of it used to assist with performing automated tests in Javascript.

I also know that Java includes code that can parse and render HTML, though someone who knows more about Java than me can probably advise more on that. I am not denying it would be very difficult to achieve something like this; you'd essentially be re-implementing a lot of what a browser does.

You could use Mozilla's rendering engine Gecko:

https://developer.mozilla.org/en/Gecko

Give a look here: http://snippets.scrapy.org/snippets/22/ it's a python screen scraping and web crawling framework used with webdrivers that open a page, render all the things you need and gives you the possibilities to "capture" anything you want in the page via

继续阅读：javascript web-crawler

Web crawler that can interpret JavaScript [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？