Can I look at the actual line that was the source of an element parsed from an html document using lxml
I have been having fun manipulating html with lxml. Now I want to do some manipulation of the actual file, after finding a particular element that meets my needs I want to know if it is possible to retrieve the source of the element.
I jumped up and down in my chair after seeing sourceline as a method of my element but that did not give me what I wanted.
some_element.sourceline
Near as I can figure, sourceline can only be used when the htm source is a file of lists so you get the line number.
I better add that I generated my elements by
theTree=html.fromstring(open(myFileRef).read())
the_elements=[e for e in theTree.iter()]
To be clear, I am getting None as the value for some_element.sourceline - I tested this for all 27,000 el开发者_开发百科ements in my tree
One thing I am imagining doing is using the html source in an expression to find that particular place in the document, maybe to snip something out. I can't rely on the text of an element because the text is not necessarily unique.
One solution that was posted but taken down was to use sourceline but even after reading in my file as a list I was not able to get any value other than None for sourceline. I am going to post another question to see if someone has an example using sourceline
I just tried and discarded html.tostring(myelement) as it converts at least some encodings automatically (I am probably not phrasing that correctly) Here is an example:
Snip of the html source
<b> KEY 1A. REGIONAL PRODUCTION <br> </b>
html.tostring(the_element,method='html')
Clearly I am not getting the original, unvarnished source.
'<b> KEY 1A.    REGIONAL PRODUCTION <br></b>'
I think I found the issue as I was having the same problem.
I believe the element.sourceline
is lost if you do any kind of xslt transform to the document when you parse it.
When I do not transform the document I get the sourceline fine, however, when I use etree.XSLT
I lose all sourceline data.
精彩评论