locate element using lxml.html vs BeautifulSoup

2023-02-28 01:36 问答作者：

I'm scraping an html document using lxml.html; there's one thing I can do in BeautifulSoup, but don't manage to do with lxml.htm. Here it is:

from BeautifulSoup import BeautifulSoup
import re

doc = ['<ht开发者_如何学Pythonml>',
'<h2> some text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> A table</td> </tr> </table>',
'<h2> some special text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> The table I want </td> </tr> </table>',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.find(text=re.compile("special")).findNext('table')

I tried this with cssselect, but no success. Any ideas on how I could locate this using the methods in lxml.html?

Many thanks, D

You can use a regular expression in an lxml Xpath, by using EXSLT syntax. For example, given your document, this will select the parent node whose text matches the regexp spe.*al:

import re
import lxml.html

NS = 'http://exslt.org/regular-expressions'
tree = lxml.html.fromstring(DOC)

# select sibling table nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table"
print tree.xpath(path, namespaces={'re': NS})

# select all sibling nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*"
print tree.xpath(path, namespaces={'re': NS})

Output:

[<Element table at 7fe21acd3f58>]
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]

继续阅读：lxml python

locate element using lxml.html vs BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？