开发者

a web spider,some method or idea for catch aynamic web page?

there are many web spider,but they just catch the html form internet. i want a web spider,some method or idea for catch aynamic web page,and can exec javascrip开发者_如何学JAVAt,and i can get information form the dom tree.


The more you'll want your spider to behave like a real browser the more you'll need a real browser; so, I recommend starting with a headless browser like Crowbar. From it's description:

[Crowbar's] purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.


If you are familiar with Java, you can try Http Unit http://httpunit.sourceforge.net/ HttpUnit is very intuitive and easy to use. Its was made for web application Unit Testing, but it can be very powerful tool for web crawling. It has integrated engine for JavaScript. Also it comes bundled with many useful libs.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜