BeautifulSoup (Python) and parsing HTML table
##### Update ###### : renderContents() instead of contents[0] did the trick. I will still leave it open if someone can provide a better, elegant solution!
I am trying to parse a number of Web pages for the desired data. The table doesn't 开发者_StackOverflow社区have a class/ID tag. So I have to search for 'website' in tr contents.
Problem at hand : Displaying td.contents works fine with just text but not hyperlinks for some reason? What am I doing wrong? Is there a better way of doing this using bs in Python?
Those suggesting lxml, I have an ongoing thread here centOS and lxml installation without admin privileges is proving to be a handful at this time. Hence exploring the BeautifulSoup option.
HTML Sample :
<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">
<a href="http://google.com"></a>
</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
stackoverflow.com
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
<a href="http://stackoverflow.com"></a>
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>
Python code so far :
f1 = open(PATH + "/" + FILE)
pageSource = f1.read()
f1.close()
soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
print "Number of tables found : " , len(alltables)
for table in alltables:
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
print td.contents[0]
from BeautifulSoup import BeautifulSoup
pageSource='''...omitted for brevity...'''
soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
results=[]
for table in alltables:
rows = table.findAll('tr')
lines=[]
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text=td.renderContents().strip('\n')
lines.append(text)
text_table='\n'.join(lines)
if 'Website' in text_table:
results.append(text_table)
print "Number of tables found : " , len(results)
for result in results:
print(result)
yields
Number of tables found : 1
Website
Last Visited
Last Loaded
<a href="http://google.com"></a>
01/14/2011
stackoverflow.com
01/10/2011
<a href="http://stackoverflow.com"></a>
01/10/2011
Is this close to what you are looking for?
The problem was that td.contents
returns a list of NavigableStrings
and soup tags
. For instance, running print(td.contents)
might yield
['', '<a href="http://stackoverflow.com"></a>', '']
So picking off the first element of the list makes you miss the <a>
-tag.
I answered a similar question here . Hope it will help you.
A lay man solution:
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
t = [x for x in soup.findAll('td')]
[x.renderContents().strip('\n') for x in t]
Output:
['Website',
'Last Visited',
'Last Loaded',
'<a href="http://google.com"></a>',
'01/14/2011\n ',
'',
' stackoverflow.com\n ',
'01/10/2011\n ',
'',
'<a href="http://stackoverflow.com"></a>',
'01/10/2011\n ',
'']
精彩评论