How do you get all the rows from a particular table using BeautifulSoup?

2022-12-16 00:06 问答作者：

I am learning Python and BeautifulSoup to scrape data from the web, and read a HTML table. I can read it into 开发者_如何学PythonOpen Office and it says that it is Table #11.

It seems like BeautifulSoup is the preferred choice, but can anyone tell me how to grab a particular table and all the rows? I have looked at the module documentation, but can't get my head around it. Many of the examples that I have found online appear to do more than I need.

This should be pretty straight forward if you have a chunk of HTML to parse with BeautifulSoup. The general idea is to navigate to your table using the findChildren method, then you can get the text value inside the cell with the string property.

>>> from BeautifulSoup import BeautifulSoup
>>> 
>>> html = """
... <html>
... <body>
...     <table>
...         <th><td>column 1</td><td>column 2</td></th>
...         <tr><td>value 1</td><td>value 2</td></tr>
...     </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
...     cells = row.findChildren('td')
...     for cell in cells:
...         value = cell.string
...         print("The value in this cell is %s" % value)
... 
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>

If you ever have nested tables (as on the old-school designed websites), the above approach might fail.

As a solution, you might want to extract non-nested tables first:

html = '''<table>
<tr>
<td>Top level table cell</td>
<td>
    <table>
    <tr><td>Nested table cell</td></tr>
    <tr><td>...another nested cell</td></tr>
    </table>
</td>
</tr>
</table>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

Alternatively, if you want to extract content of all the tables, including those that nest other tables, you can extract only top-level tr and th/td headers. For this, you need to turn off recursion when calling the find_all method:

soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
    cnt += 1
    print ('=============== TABLE {} ==============='.format(cnt))
    rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
    for row in rows:
        cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
        for cell in cells:
            # DO SOMETHING
            if cell.string: print (cell.string)

Output:

=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell

The recursive is a great trick if you don't have nested tables, but if you do, then you need to do things one level at a time.

The one HTML variation that could bite you is the following where the tbody and or thead elements are also used.

html = '
    <table class="fancy">
        <thead>
           <tr><th>Nested table cell</th></tr>
        </thead>
        <tbody>
            <tr><td><table id=2>...another nested cell</table></td></tr>
        </tbody> 
        </table>
    </table>

in this situation, you will need to do the following

   table = soup.find_all("table", {"class": "fancy"})[0]
    thead = table.find_all('thead', recursive=False)
    header = thead[0].findChildren('th')
    
    tbody = table.find_all('tbody', recursive=False)
    rows = tbody[0].find_all('tr', recursive=False)

now you have the head and rows

继续阅读：python

How do you get all the rows from a particular table using BeautifulSoup?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？