Scrape HTML tables from a given URL into CSV
I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n
is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n
is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.
Potential additional features:
- To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
- An option to asciify any unicode.
- An option to apply an arbitrary regex substitu开发者_Go百科tion for fixing weirdnesses in the parsed table.
What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)
Related questions:
- How can I scrape an HTML table to CSV?
- How can I convert an HTML table to CSV?
- Options for HTML scraping?
This is my first attempt:
http://yootles.com/outbox/tablescrape.py
It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:
./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:
1: [ 1 cols, 1 rows] Contents 1 Men's rec
2: [ 7 cols, 25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [ 7 cols, 24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]
Then if you run it again, asking for the 2nd table,
./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2
You get a reasonable plaintext data table:
100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]
Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.
This first script lists the tables on the page:
# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%
set %Count% 1
foreach %Table% in (response //table)
Notice Table #%Count%
# find a suitable name, look back for a header
set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
if checkIn %Table% %Check%
Notice (selectIn %Table% %Check%)
end
set %Count% as binOp %Count% + 1
end
The second script then extracts the data of one table into a CSV file.
# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2
GotoURL %Cmds.Site%
set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]
# Get an cleanup the header fields
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
putin %Fields% (trim %Header%)
end
Notice %Fields%
# Create an output CSV
call unit.file.CreateDataFile with
%Name% %This:Dir%/extract_table.csv
%Format% csv
%Fields% %Fields%
end
set %DataFile% %Return:Value%
# Now extract each row
foreach %Row% in (response %Rows%)
set %Record% withvector
end
foreach %Cell% in (selectIn %Row% ./td)
putin %Record% (trim %Cell%)
end
call unit.file.WriteDataFile with
%DataFile% %DataFile%
%Record% %Record%
end
end
call unit.file.CloseDataFile with
%DataFile% %DataFile%
end
My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.
Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]
精彩评论