Once I have identified the beginning and end parts of a section of an html document using lxml, how do I get everything between them

2023-01-11 16:16 问答作者：

I am working with some html files. I am trying to figure out a way to consistently get to some text that exists in the documents. I know that the section I want begi开发者_如何学Gons with some bolded words and I know that the section ends with other bolded words.

bolded_item=atree.cssselect('b')

myKeys=[item for item in bolded_items if item.text if 'KEY' in item.text]

so myKeys is a list whose members are elements from atree, specifically elements that have bolded text and have the word 'KEY' in the text.

I want now to identify all of the parts of the tree between any 2 elements in myKeys I want to be able to manipulate them in various ways. I was playing around with getparent, getchildren getnext and all of the other methods that looked likely after running a dir(myKeys[0]) but I am not making progress.

Any suggestions would be appreciated

I'd suggest using SAX for this task.

Basic docs are available at http://lxml.de/sax.html#producing-sax-events-from-an-elementtree-or-element

Your handler should consume events w/out any action till it receives needed bolded item, and then it writes events into new buffer/tree/whatever till it receives terminating bolded item.

In the spirit of SO I have figured out what I think is the best answer and am going to post it myself.

import lxml
from lxml import html
testFile=open(r'c:\temp\testlxml.htm').read()
aTree=html.fromstring(testFile)
bolds=aTree.cssselect('b')
theTitles=[item.text for item in bolds if item.text if 'KEY' in item.text]
theBoldKeys=[item for item in bolds if item.text if 'KEY' in item.text]
theFullList=[]
for e in aTree.iter():
    theFullList.append(e)

for numb,item in enumerate(theFullList):
    if item==theBoldItems[0]:
        first=numb
    if item==theBoldItems[1]:
        second=numb
theText=[]
for item in theFullList[first:second]:
    if item.text:
        theText.append(item.text)
    if item.tail:
       theText.append(item.tail)

aString=' '.join(theText)

A little bit of explanation.

My goal is to apply some logic to the bolded parts of the documents as those bolded sections that have the word KEY in them define different sections of the document. TheTitles is a list of the bolded elements that have the word 'KEY' included. Based on my particular needs I might want all of the text between any two items from theTitles, I can create tests and the necessary logic to select items from theTitles.

theBoldItems is a list of the actual elements, for any i theTitles[i]==theBoldItems[i].text

next I get theFullList which is all of the htm elements in the tree. Because LXML builds the tree in order I know that I want to capture all of the elements theBoldItems[i] and theBoldItems[i+1]. And the nice thing is that the way Python is built the test is that easy.

I can now get the text for all of those things and while I still need to clean it up some I have successfully ripped out all of the text between any two items I might want.

继续阅读：lxml parsing python

Once I have identified the beginning and end parts of a section of an html document using lxml, how do I get everything between them

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？