开发者

extract specific element from nested elements using lxml html

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
            开发者_如何学Go        <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?


Use:

//td[text() = 'Header1']/ancestor::table[1]


Find the header you are interested in and then pull out its table.

//u[b = 'Header1']/ancestor::table[1]

or

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). You can't do:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.

Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.


Perhaps this would work for you:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table.


table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
  • //*[text()="Header1"] selects an element anywhere in a document with text Header1.
  • ancestor::table[1] selects the first ancestor of the element that is table.

Complete example

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜