Importing tables in Mathematica from web - empty cell problem
I use:
data=Import["http://weburl/","Data"]
to import data from one site. On that page there are tables. This creates nested lists, and you can easily get the data in table form. For example:
Grid[data[[1]]]
would give something like this:
Player Age Shots Goals
P1 24 10 2
P2 22 5 0
P3 28 11 1
...
Now, here is the problem. If one cell in the html table is empty, for example an entry for "Age", then in html this would look like this: <td></td>
. Mathematica doesn't include take it in the list at all, not even as, for example, a "Null" value. Instead, this row would just be represented by a list of length 3 and data would be moved by one column, so you'd get "Shots" in place of "Age" and "Goals" in place of "Shots" and "Goals" would be empty.
For example, a "P4" whos age is unknown (empty cell in html table), who had 10 shots and scored 0 goals would be imported as list of length 3 not 4 and moved by one:
Player Age Shots Goals
P1 24 10 2
P2 22 5 0
P3 10 0
...
This poses a difficult problem, because if you have a few empty fields then you can't tell from the list to which column it belongs. Is there a way to put a "Null" on an empty cell in html tables when importing in Mathematica? For 开发者_JAVA技巧example, P4 element in list would look like this:
data[[1,5]]
{"P4","Null",10,0}
instead of:
{"P4",10,0}
As lumeng points out, you can use FullData
to get the HTML table element to fill out properly. Here's a simpler illustration of this.
in = ImportString["\<<html><table>
<tr>
<td>(1,1)</td>
<td>(1,2)</td>
<td>(1,3)</td>
</tr>
<tr>
<td>(2,1)</td>
<td></td>
<td>(2,3)</td>
</tr>
</table></html>\>",
{"HTML", "FullData"}];
Grid[in[[1, 1]]]
If you want more complete control of the output, I'd suggest that you Import
the page as XML. Here's an example.
in = ImportString["\<<html><table>
<tr>
<td>(1,1)</td>
<td>(1,2)</td>
<td>(1,3)</td>
</tr>
<tr>
<td>(2,1)</td>
<td></td>
<td>(2,3)</td>
</tr>
</table></html>\>", "XML"];
Column[Last /@ Cases[in,
XMLElement["td", ___], Infinity]]
You'll need to read up a bit on XML in general and Mathematica's version, namely the XMLObject
. It's a delight to work with, once you get the hang of it, though.
In[13]:= htmlcode = "<html><table border=\"1\">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>";
In[14]:= file = ToFileName[{$TemporaryDirectory}, "tmp.html"]
Out[14]= "/tmp/tmp.html"
In[15]:= OpenWrite[file]
WriteString[file,htmlcode]
Close[file]
FilePrint[file]
Out[15]= OutputStream[/tmp/tmp.html,18]
Out[17]= /tmp/tmp.html
During evaluation of In[15]:=
<html><table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>
In[23]:= Import[file,"Elements"]//InputForm
Out[23]//InputForm=
{"Data", "FullData", "Hyperlinks", "ImageLinks", "Images", "Plaintext", "Source", "Title", "XMLObject"}
In[22]:= Import[file,"FullData"]//InputForm
Out[22]//InputForm=
{{{{"row 1, cell 1", "row 1, cell 2", "row 1, cell 3"}, {"row 2, cell 1", "", "row 2, cell 3"}}}, {}}
Using Computist's sample, you could also do:
htmlcode = "<html><table border=\"1\">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
<td>row 1, cell 3</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td></td>
<td>row 2, cell 3</td>
</tr>
</table><html>";
StringReplace[htmlcode, "<td></td>" -> "<td>###</td>"];
ImportString[%, "Data"] /. "###" -> Null
精彩评论