Data extraction from source with lots of white space
I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html
Ideally I'd be able to get a CSV file with the address, ward, price, and 开发者_JAVA百科square feet? Is there an easy way to do this?
The process of extracting information like this from webpages is known colloquially as "scraping". If it was me I'd use the python language and the "Beautiful Soup" package to do it. However, a google for "screen scrape" or "web scrape" and your favourite programming language should find you a package that will do the hard work for you.
You can run IRobotSoft web scraper, open the page in its browser window, and use menu: Design -> Practice HTQL. Give the following HTQL query in the input box to transform the page into a standard HTML table:
<hr sep>2-0{
a=<center>1 &tx &trim;
b=<center>1:xx ./'nbsp'/1 &tx &trim('&; ');
c=<center>1:xx ./'nbsp'/3 ./'\n'/1 &tx &trim('&; ');
d=<center>1:xx ./'nbsp'/3 ./'Ward'~'BRT#'/1 &tx;
e=<center>1:xx ./'nbsp'/3 ./'BRT#'~'Improvements:'/1 &tx;
f=<center>1:xx ./'nbsp'/3 ./'Improvements:'/2 &tx;
g=<br sep>2. /'nbsp'/1 &tx &trim('&; ');
h=<br sep>2. /'nbsp'/3 &tx &trim('&; ');
i=<br sep>2. /'nbsp'/5 &tx &trim('&; ');
j=<br sep>2. /'nbsp'/7 &tx &trim('&; ');
}
精彩评论