get XPATH for all the nodes

2023-02-24 10:45 问答作者：

Is there a library that can开发者_如何学Go give me the XPATH for all the nodes in an HTML page?

is there any library that can give me XPATH for all the nodes in HTML page

Yes, if this HTML page is a well-formed XML document.

Depending on what you understand by "node"...

//*

selects all the elements in the document.

/descendant-or-self::node()

selects all elements, text nodes, processing instructions, comment nodes, and the root node /.

//text()

selects all text nodes in the document.

//comment()

selects all comment nodes in the document.

//processing-instruction()

selects all processing instructions in the document.

//@*

selects all attribute nodes in the document.

//namespace::*

selects all namespace nodes in the document.

Finally, you can combine any of the above expressions using the union (|) operator.

Thus, I believe that the following expression really selects "all the nodes" of any XML document:

/descendant-or-self::node() | //@* | //namespace::*

In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.

To get the tree:

import lxml
from lxml import html, etree

your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
bad_html = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*')

On the last line, replace '//*' with whatever xpath you want. all_xpaths is now a list which looks like this:

[<Element html at 0x7ff740b24b90>,
 <Element head at 0x7ff740b24d88>,
 <Element title at 0x7ff740b24dd0>,
 <Element body at 0x7ff740b24e18>,
 <Element h1 at 0x7ff740b24e60>]

继续阅读：parsing

get XPATH for all the nodes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？