开发者

Top down or bottom up approach to search elements on a HTML DOM document?

Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.

For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.

1) I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)

2) It's distinguishable since it appears to be inside a table and pargraph or contain开发者_StackOverflower.

3) There can be an acceptable level of unexpected parents or children before this desired link mentioned in 1) but I don't know what. More unexpected elements would mean departure from 1).

4) Identifying via element's id and class or any other unique attribute value is not desired.

I think the following xpath should sum up:

/`/p/table/tr/td/a`

on some pages there is variations to the xpath but it still qualifies as 1) desired link

//p/div/table/tr/td/a or //p/div/span/span/table/tr/td/b/a

I have used indentation to mimic each loop iteration (

(should I use plurral or singular ? children vs child. parents vs parent. I think singular makes sense as the immediate parent or child is of concern here.)

TOP DOWN SEARCHING:

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ? 

BOTTOM UP SEARCHING:

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.

I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.

Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.


Seems that you want something like:

//p//table//a

If you have limitations for the number of intermediate elements in the path, say not more than 2, then the above would be modified to:

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

This selects all a elements whose parent or grand-parent is td, whose parent is a tr, whose parent is a table, whose parent or grandparent is a p that has less than 3 ancesstor - elements.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜