开发者

Get table with maximum number of rows in a page using BeautifulSoup

Can anyone tell me how i can get the table in a HTML page which has a the most rows? I'm using BeautifulSoup.

There is one little problem though. Sometimes, there seems to be one table nested inside another.

<table>
    <tr>
        <td>
            <table>
                <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
                <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
    开发者_JAVA技巧            <tr>
                    <td></td>
                    <td></td>
                    <td></td>
                </tr>
            </table>
        <td>
    </tr>
</table>

When the table.findAll('tr') code executes, it would count all the child rows for the table and the rows for the nested table under it. The parent table has just one row but the nested table has three and I would consider that to be the largest table. Below is the code that I'm using to dig out the largest table currently but it doesn't take the aforementioned scenario into consideration.

soup = BeautifulSoup(html)

#Get the largest table
largest_table = None
max_rows = 0
for table in soup.findAll('table'):
    number_of_rows = len(table.findAll('tr'))
    if number_of_rows > max_rows:
        largest_table = table
        max_rows = number_of_rows

I'm really lost with this. Any help guys?

Thanks in advance


Calculate number_of_rows like that:

number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜