Parsing HTML Tables with BeautifulSoup
I have used BeautifulSoup in the past but I am up against something new; incredibly generic/minimal HTML table markup... My goal is to grab each value and it's label (each in there own td) and print them out... They can be merged, I don't care, I just want to make sure each label is applied to the correct value. Here is an example table:
<tbody><tr>
<td class="labels">Dawn:</td>
<td class="site_data" style="text-align: left;">07:01</td>
<td class="labels">Sunrise:</td>
<td class="site_data" style="text-align: left;">07:26</td>
<td class="labels">Moonrise:</td>
<td class="site_data" style="text-align: left;">14:29</td>
<td rowspan="3"><img src="images/moon.bmp" alt="Moon" width="64" align="left" border="0" height="64" style="margin: 0px 10px" /></td>
</tr>
<tr>
<td class="labels">Dusk:</td>
<td class="site_data" style="text-align: left;">18:27</td>
<td class="labels">Sunset: </td>
<td class="site_data" style="text-align: left;">18:02</td>
<td class="labels">Moonset:</td>
<td class="site_data" style="text-align: left;">01:55</td>
</tr>
<tr>
<td class="labels">Daylight:</td>
<td class="site_data" style="text-align: left;">11:26</td>
<td class="labels">Day length:</td>
<td class="site_data" style="text-align: left;">10:36</td>
<td class="labels">Moon Phase:</td>
<td class="site_data" style="text-align: 开发者_JAVA百科left;">Waxing Gibbous</td>
</tr>
</tbody>
I know how to grab these values...
for td in soup.findAll('table')[0]: # theres more than one table on the page
print td.renderContents().strip()
but this only gives me....
'Dawn:'
'07:01'
'Sunrise:'
'07:26'
'Moonrise:'
'14:29'
'<img src="images/moon.bmp" alt="Moon" width="64" align="left" border="0" height="64" style="margin: 0px 10px" />'
'Dusk:'
'18:27'
'Sunset: '
'18:02'
'Moonset:'
'01:55'
'Daylight:'
'11:26'
'Day length:'
'10:36'
'Moon Phase:'
'Waxing Gibbous'
I guess I could grab onto those class values "labels" and "site_data" but how do I make sure the labels and data are grouped correctly?
The following should be simpler and easier to follow:
import pprint
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(docTxt)
groupedData = []
for row in soup.findAll("tr"):
data = {}
allTDs = row.findAll("td")
for x in range(0, len(allTDs)-1, 2):
data[allTDs[x].renderContents().strip()] = allTDs[x+1].renderContents().strip()
groupedData.append(data)
pprint.pprint(groupedData)
output:
[{'Dawn:': '07:01', 'Moonrise:': '14:29', 'Sunrise:': '07:26'},
{'Dusk:': '18:27', 'Moonset:': '01:55', 'Sunset: ': '18:02'},
{'Day length:': '10:36',
'Daylight:': '11:26',
'Moon Phase:': 'Waxing Gibbous'}]
I'm not a BeautifulSoup expert, but you could try something like this:
for label in soup.findAll('table')[0].findAll('td', attrs={'class' : 'labels'}):
data_sibs = label.findNextSiblings(attrs={'class' : 'site_data'})
if len(data_sibs) > 0:
print label.renderContents().strip() + " " + data_sibs[0].renderContents().strip()
Edit:
Tested and produces the following:
Dawn: 07:01
Sunrise: 07:26
Moonrise: 14:29
etc..
精彩评论