Help with screen scraping/parsing

2023-03-08 05:51 问答作者：

I have been attempting to scrape and eventually parse some data(specifically availabilities and price) from hostels.com, for example http://www.hostels.com/hosteldetails.php/HostelNumber.11890. The problem is, once you select the number of nights and select "book now" nothing is passed through the URL string(its all done through Ajax, I belive) I cant go directly to a specific date or time frame.

I have attempted browser emulators such as Selenium, IRobotSoft and FakeApp and although I did get Selenium and Fake to do much of the work capturing the full source, it was ugly and still tedious when having to scrape(and parse with other software) multiple pages a day.

I have also tried HTML DOM Parser, PHP Scriptable Web Browser, HTMLUnit, cScrape.php, Crowbar. Either they couldn't handle the Ajax or I had no luck getting even them to run.

Ideally I would like something that can run from a server, with as few dependencies as possible, but at this point I would just like to get it running.

Now after spending many hours trying to get this working. I still feel I'm not sure where to begin. Can someone just point me in the right dir开发者_如何学JAVAection?. Should I go back and spend more time with HTMLUnit? what would be the best practice for a site like this?

Thanks

I'm really into Node.js atm (server-side javascript, in case you're not familiar), so that's what I'm recommending. What's awesome about using it to scrape sites is you can use jQuery or whatever your favorite JS framework is to do all the work of parsing for the info you want! See the following resources to get started:

http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga

https://github.com/tmpvar/jsdom

https://github.com/chriso/node.io/wiki/Scraping

https://github.com/joshfire/node-crawler

The page you are referring to does not seem to be using AJAX. Instead what you are referring to as AJAX is a POST request (as opposed to stuff passed in the url, which is a GET request). I suggest you read up on difference between them. Try to understand what going on, it is more important than relying on some third-party tool which might turn out to be very inflexible.

Install Firebug and watch which variables are sent in the POST request. Now do the same thing in your favourite programming language. Parse the response HTML for the POST request for the necessary information.

Also, +1 for the effort of trying so many different solutions and not giving up.

I've found Celerity (http://celerity.rubyforge.org), a JRuby library that uses HTMLUnit under the hood, to be a very robust solution for "data acquisition via the Web".

Celerity being Ruby, I found, was much faster to develop with in comparison to full blown Java (HTMLUnit). Also, due to Celerity's "wrapping" of HTMLUnit -- I was able to drop down to HTMLUnit as I needed to do some heavier lifting.

I've had success with sites that are rich in DHTML, as well as utilize Ajax; and while I made have used some sleep() calls to wait on the Ajax responses -- everything worked as expected.

Give it a try!

继续阅读：javascript parsing scrape

Help with screen scraping/parsing

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？