开发者

How to define findAll for html nested tags using beautifulsoup

Given

<a href="www.example.com/"></a>

<table class="theclass">
<tr><td>
<a href="www.example.com/two">two</a>
</td></tr>
<tr><td>
<a href ="www.example.com/three">three</a>
<span>blabla<span>
</td></td>
</table>

How can I scrape only the that is inside table class="the class"? I tried using

soup = util.mysoupopen(theexample) 
infoText = soup.findAll("table", {"class": "the class"})

but I did not know how to further define the finding statement. Something else I tried, was turning the result of findAll() into an array. Then looking 开发者_JAVA技巧for patterns of when the needle would show up, but I couldnt find a consistent pattern. Thanks


If I understood your question. That's the python code which should work. Iterating to find all tables with the class="theclass", then finding links inside.

>>> foo = """<a href="www.example.com/"></a>
... <table class="theclass">
... <tr><td>
... <a href="www.example.com/two">two</a>
... </td></tr>
... <tr><td>
... <a href ="www.example.com/three">three</a>
... <span>blabla<span>
... </td></td>
... </table>
... """
>>> import BeautifulSoup as bs
>>> soup = bs.BeautifulSoup(foo)
>>> for table in soup.findAll('table', {'class':'theclass'} ):
...     links=table.findAll('a')
... 
>>> print links
[<a href="www.example.com/two">two</a>, <a href="www.example.com/three">three</a>]


infoText is a list. You should iterate over it.

>>>for info in infoText:
>>>    print info.tr.td.a
<a href="www.example.com/two">two</a>

Then you can access the <table> element. If you are just expecting one table element with a class "theclass" in your document, soup.find("table", {"class": "the class"}) would give you the table directly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜