开发者

Create csv from html pages

There is a website that displays a lot of data in html tables. They have page开发者_如何学运维d the data so there are around 500 pages.

What is the most convenint (easy) way of getting the data in those tables and download it a CSV, on Windows?

Basically I need to write a script that does something like this but is overkilling to write in in C# and I am looking for other solutions that people with web experience use:

for(i=1 to 500)
   load page from http://x/page_i.html;
   parse the source and get the data in table with id='data'
   save results in csv

Thanks!


I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.


The easiest non-C# way I can think of is to use Wget to download the page, then run HTMLTidy to convert it to XML/XHTML and then transform the resulting XML to CSV with an XSLT (run with MSXSL.exe)

You will have to write some simple batch files and an XSLT with a basic XPath selector.

If you feel it would be easier to just do it in C#, you can use SgmlReader to read the HTML DOM and do an XPath query to extract the data. It should not take more than about 20 lines of code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜