开发者

scrape web page for certain data

We are creating a script.

Essentially user enters into a formfi开发者_如何学JAVAeld :3358928 OnSubmit ajax, visits the page below appending the numeric string the user entered into the url.

http://www.fairtrading.qld.gov.au/ftlr/Default.aspx?ResultType=LNum&LNum=3358928&LType=REAL%20ESTATE&Page=1

On that url, is a first name and a surname. How would we scrape the first name and surname, and echo it back to our form.

Essentially the issue arises, from scraping the page.

Any help appreciated.


First, your web server must be set up to proxy all of the client's requests. Otherwise, the third-party server would have to send an Access-Control-Allow-Origin header and the visitor's browser would have to support cross-domain XMLHttpRequest. (Flash/Silverlight similarly requires a crossdomain.xml file.)

This is exactly the way http://ajax-cross-domain.com/ works. (That particular proxy script just happens to JavaScript-encode the third-party server's response.)

I noticed that the particular page includes an XHTML doctype, which seemed to indicate you could use the responseXML property of native XMLHttpRequest or jQuery (as opposed to AJAX Cross Domain) to take advantage of the browser's XML parser. Unfortunately, this is just another web site that outputs invalid XML — it does not encode ampersands correctly as &.

Thus you will most likely resort to regular expressions, which is not ideal. Most probably, the simplest approach is to find the text of the td elements (relying on the fact that the exact same tag is not nested):

// Creating the regexp object    
var regex = /<td class="BodyFont">(.*?)<\/td>/g;

// Execute this line of code as many times as needed.
contentsOfNextTd = regex.exec(textOfThePage)[1];

Sort of ugly, considering that it would be much simpler if only we had valid XML to work with. And if you have the option, I would recommend scraping the page on your own server and returning a nicely-formatted JSON or XML response — you need a server-side proxy anyways and it will keep the client-side code simpler.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜