开发者

What is the most elegant way to do screen scraping in node.js?

I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:

  1. Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.

  2. Redirect following. I want each request to follow through redirects when a 302 status code is returned.

I came across two things which looked useful, but I couldn't use in the end开发者_StackOverflow:

  • http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.

  • http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.

Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?


i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.

feedback is welcome :)


You may want to check out https://github.com/mikeal/request from mikeal, I just spoke to him the chatroom and he says that it does not handle cookies at the moment but you can write a submodule to handle these for you in the meantime.

in regards to redirect it handles beautifully :)


It turns out someone made a phantomjs module for node.js:

https://github.com/sgentle/phantomjs-node

While phantom is fairly heavy, it also supports SSL, cookies, and everything else a typical browser supports (since it is a webkit browser, after all).

Give it a shot, it may be exactly what you are looking for.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜