Strategy to extract structured data with xpath

2023-02-20 04:49 问答作者：

Is there a pattern to extract structured data from an HTML page using XPath? I'm trying to extract data from one or more HTML tables on a page. XPath makes it easy to find the table(s), but I'm struggling once I've got that far.

I开发者_运维百科'm currently doing the following:

Iterate the tables (there may be more than one)
Iterate the rows within that table
Iterate the cells within that row
(Then probably put them in an array and parse the contents)

My code is something like this:

var tables = mydoc.evaluate( "//table", mydoc, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null );

table = tables.iterateNext();
while (table)
{
  var rows = mydoc.evaluate("tbody/tr", table, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);
  row = rows.iterateNext();
  while (row)
  {
    var tds = mydoc.evaluate("td", row, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null)
    td = tds.iterateNext()
    while(td)
    {
      // TODO: store content in an array to process later
      print('*' + td.textContent);
      td = tds.iterateNext();
    }
    row = rows.iterateNext();
  }

  table = iterator.iterateNext();
}

This seems a little nasty as all the XPath examples seem to do their processing in one step. There appear to be few non-trivial examples where two types of data (e.g. labels and values in a table) are selected and combined. I can use the following selectors, but I end up with two lists with no structure:

//table/tbody/tr/td[@class='label']
//table/tbody/tr/td/a[@class='value']

(I know I'm using XPath for HTML parsing for which it wasn't really intended, but it seems to work so far.)

There appear to be few non-trivial examples where two types of data (e.g. labels and values in a table) are selected and combined. I can use the following selectors, but I end up with two lists with no structure:
//table/tbody/tr/td[@class='label'] 
//table/tbody/tr/td/a[@class='value']

Use:

    //table/tbody/tr/td[@class='label']
|
    //table/tbody/tr/td/a[@class='value']

This single XPath expression selects all the wanted nodes (all XPath engines I am aware of return the selected nodes in document order). The | (union) operator produces the set union of its arguments.

If the (x)Html document has regular structure, you may expect in the returned result every selected td element (label) to be followed by its corresponding a element (value)

If it's on the main HTML page, you could just do:

for(var tables=document.getElementsByTagName("table"),i=0;i<tables.length;++i)
  for(var rows=tables[i].getElementsByTagName("tr"),j=0;j<rows.length;++j)
    for(var cells=rows[j].getElementsByTagName("td"),k=0;k<cells.length;++k)
      print("*"+cells[i].textContent);

getElementsByTagName does /not/ return an array - it returns a live NodeList similar to ORDERED_NODE_ITERATOR_TYPE.

继续阅读：javascript

Strategy to extract structured data with xpath

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？