Beautiful Soup [Python] and the extracting of text in a table

2023-01-29 20:49 问答作者：

i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:

I want to take the content of the first td of a table in a html document. For example, i have this table

<table class="b开发者_Python百科p_ergebnis_tab_info">
    <tr>
            <td>
                     This is a sample text
            </td>

            <td>
                     This is the second sample text
            </td>
    </tr>
</table>

How can i use beautifulsoup to take the text "This is a sample text"? I use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get the whole table.

Thanks... or should i try to get the whole stuff with Perl ... which i am not so familiar with. Another soltion would be a regex in PHP.

See the target [1]: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323

Note; since the html is a bit invalid - i think that we have to do some cleaning. That can cause a lot of PHP code - since we want to solve the job in PHP. Perl would be a good solution too.

Many thanks for some hints and ideas for a starting point zero

First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):

table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})

Then use find again to find the first td:

first_td = table.find('td')

Then use renderContents() to extract the textual contents:

text = first_td.renderContents()

... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:

trimmed_text = text.strip()

This should give:

>>> print trimmed_text
This is a sample text
>>>

as desired.

Use "text" to get text between "td"

1) First read table DOM using tag or ID

soup = BeautifulSoup(self.driver.page_source, "html.parser")
htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})

2) Read tbody

tbody = htnm_migration_table.find('tbody')

3) Read all tr from tbody tag

trs = tbody.find_all('tr')

4) get all tds using tr

for tr in trs:
      tds = tr.find_all('td')
      for td in tds:
      print(td.text)

I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify() command if you want to get a valid reformatted page source with valid markup.

As for your question, the result of your first soup.findAll(...) command is also a Beautiful Soup object and you can make a second search in it, like this:

table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
your_sample_text = table_soup.find("td").renderContents().strip()

print your_sample_text

继续阅读：php python

Beautiful Soup [Python] and the extracting of text in a table

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？