Strategy to extract structured data with xpath
Is there a pattern to extract structured data from an HTML page using XPath? I'm trying to extract data from one or more HTML tables on a page. XPath makes it easy to find the table(s), but I'm struggling once I've got that far.
I开发者_运维百科'm currently doing the following:
- Iterate the tables (there may be more than one)
- Iterate the rows within that table
- Iterate the cells within that row
- (Then probably put them in an array and parse the contents)
My code is something like this:
var tables = mydoc.evaluate( "//table", mydoc, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null );
table = tables.iterateNext();
while (table)
{
var rows = mydoc.evaluate("tbody/tr", table, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);
row = rows.iterateNext();
while (row)
{
var tds = mydoc.evaluate("td", row, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null)
td = tds.iterateNext()
while(td)
{
// TODO: store content in an array to process later
print('*' + td.textContent);
td = tds.iterateNext();
}
row = rows.iterateNext();
}
table = iterator.iterateNext();
}
This seems a little nasty as all the XPath examples seem to do their processing in one step. There appear to be few non-trivial examples where two types of data (e.g. labels and values in a table) are selected and combined. I can use the following selectors, but I end up with two lists with no structure:
//table/tbody/tr/td[@class='label']
//table/tbody/tr/td/a[@class='value']
(I know I'm using XPath for HTML parsing for which it wasn't really intended, but it seems to work so far.)
There appear to be few non-trivial examples where two types of data (e.g. labels and values in a table) are selected and combined. I can use the following selectors, but I end up with two lists with no structure:
//table/tbody/tr/td[@class='label'] //table/tbody/tr/td/a[@class='value']
Use:
//table/tbody/tr/td[@class='label']
|
//table/tbody/tr/td/a[@class='value']
This single XPath expression selects all the wanted nodes (all XPath engines I am aware of return the selected nodes in document order). The |
(union) operator produces the set union of its arguments.
If the (x)Html document has regular structure, you may expect in the returned result every selected td
element (label) to be followed by its corresponding a
element (value)
If it's on the main HTML page, you could just do:
for(var tables=document.getElementsByTagName("table"),i=0;i<tables.length;++i)
for(var rows=tables[i].getElementsByTagName("tr"),j=0;j<rows.length;++j)
for(var cells=rows[j].getElementsByTagName("td"),k=0;k<cells.length;++k)
print("*"+cells[i].textContent);
getElementsByTagName does /not/ return an array - it returns a live NodeList similar to ORDERED_NODE_ITERATOR_TYPE.
精彩评论