开发者

Data extraction from source with lots of white space

I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html

Ideally I'd be able to get a CSV file with the address, ward, price, and 开发者_JAVA百科square feet? Is there an easy way to do this?


The process of extracting information like this from webpages is known colloquially as "scraping". If it was me I'd use the python language and the "Beautiful Soup" package to do it. However, a google for "screen scrape" or "web scrape" and your favourite programming language should find you a package that will do the hard work for you.


You can run IRobotSoft web scraper, open the page in its browser window, and use menu: Design -> Practice HTQL. Give the following HTQL query in the input box to transform the page into a standard HTML table:

<hr sep>2-0{
a=<center>1 &tx &trim;
b=<center>1:xx ./'nbsp'/1 &tx &trim('&; ');
c=<center>1:xx ./'nbsp'/3 ./'\n'/1 &tx &trim('&; ');
d=<center>1:xx ./'nbsp'/3 ./'Ward'~'BRT#'/1 &tx;
e=<center>1:xx ./'nbsp'/3 ./'BRT#'~'Improvements:'/1 &tx;
f=<center>1:xx ./'nbsp'/3 ./'Improvements:'/2 &tx;
g=<br sep>2. /'nbsp'/1 &tx &trim('&; ');
h=<br sep>2. /'nbsp'/3 &tx &trim('&; '); 
i=<br sep>2. /'nbsp'/5 &tx &trim('&; ');
j=<br sep>2. /'nbsp'/7 &tx &trim('&; ');
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜