Top down or bottom up approach to search elements on a HTML DOM document?

2023-02-02 01:11 问答作者：

Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.

For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.

1) I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)

2) It's distinguishable since it appears to be inside a table and pargraph or contain开发者_StackOverflower.

3) There can be an acceptable level of unexpected parents or children before this desired link mentioned in 1) but I don't know what. More unexpected elements would mean departure from 1).

4) Identifying via element's id and class or any other unique attribute value is not desired.

I think the following xpath should sum up:

/`/p/table/tr/td/a`

on some pages there is variations to the xpath but it still qualifies as 1) desired link

//p/div/table/tr/td/a or //p/div/span/span/table/tr/td/b/a

I have used indentation to mimic each loop iteration (

(should I use plurral or singular ? children vs child. parents vs parent. I think singular makes sense as the immediate parent or child is of concern here.)

TOP DOWN SEARCHING:

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ?

BOTTOM UP SEARCHING:

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.

I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.

Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.

Seems that you want something like:

//p//table//a

If you have limitations for the number of intermediate elements in the path, say not more than 2, then the above would be modified to:

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

This selects all a elements whose parent or grand-parent is td, whose parent is a tr, whose parent is a table, whose parent or grandparent is a p that has less than 3 ancesstor - elements.

继续阅读：algorithm design-patterns dom

Top down or bottom up approach to search elements on a HTML DOM document?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？