开发者

Help with screen scraping/parsing

I have been attempting to scrape and eventually parse some data(specifically availabilities and price) from hostels.com, for example http://www.hostels.com/hosteldetails.php/HostelNumber.11890. The problem is, once you select the number of nights and select "book now" nothing is passed through the URL string(its all done through Ajax, I belive) I cant go directly to a specific date or time frame.

I have attempted browser emulators such as Selenium, IRobotSoft and FakeApp and although I did get Selenium and Fake to do much of the work capturing the full source, it was ugly and still tedious when having to scrape(and parse with other software) multiple pages a day.

I have also tried HTML DOM Parser, PHP Scriptable Web Browser, HTMLUnit, cScrape.php, Crowbar. Either they couldn't handle the Ajax or I had no luck getting even them to run.

Ideally I would like something that can run from a server, with as few dependencies as possible, but at this point I would just like to get it running.

Now after spending many hours trying to get this working. I still feel I'm not sure where to begin. Can someone just point me in the right dir开发者_如何学JAVAection?. Should I go back and spend more time with HTMLUnit? what would be the best practice for a site like this?

Thanks


I'm really into Node.js atm (server-side javascript, in case you're not familiar), so that's what I'm recommending. What's awesome about using it to scrape sites is you can use jQuery or whatever your favorite JS framework is to do all the work of parsing for the info you want! See the following resources to get started:

http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga

https://github.com/tmpvar/jsdom

https://github.com/chriso/node.io/wiki/Scraping

https://github.com/joshfire/node-crawler


The page you are referring to does not seem to be using AJAX. Instead what you are referring to as AJAX is a POST request (as opposed to stuff passed in the url, which is a GET request). I suggest you read up on difference between them. Try to understand what going on, it is more important than relying on some third-party tool which might turn out to be very inflexible.

Install Firebug and watch which variables are sent in the POST request. Now do the same thing in your favourite programming language. Parse the response HTML for the POST request for the necessary information.

Also, +1 for the effort of trying so many different solutions and not giving up.


I've found Celerity (http://celerity.rubyforge.org), a JRuby library that uses HTMLUnit under the hood, to be a very robust solution for "data acquisition via the Web".

Celerity being Ruby, I found, was much faster to develop with in comparison to full blown Java (HTMLUnit). Also, due to Celerity's "wrapping" of HTMLUnit -- I was able to drop down to HTMLUnit as I needed to do some heavier lifting.

I've had success with sites that are rich in DHTML, as well as utilize Ajax; and while I made have used some sleep() calls to wait on the Ajax responses -- everything worked as expected.

Give it a try!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜