开发者

BeautifulSoup (Python) and parsing HTML table

##### Update ###### : renderContents() instead of contents[0] did the trick. I will still leave it open if someone can provide a better, elegant solution!

I am trying to parse a number of Web pages for the desired data. The table doesn't 开发者_StackOverflow社区have a class/ID tag. So I have to search for 'website' in tr contents.

Problem at hand : Displaying td.contents works fine with just text but not hyperlinks for some reason? What am I doing wrong? Is there a better way of doing this using bs in Python?

Those suggesting lxml, I have an ongoing thread here centOS and lxml installation without admin privileges is proving to be a handful at this time. Hence exploring the BeautifulSoup option.

HTML Sample :

                   <table border="2" width="100%">
                      <tbody><tr>
                        <td width="33%" class="BoldTD">Website</td>
                        <td width="33%" class="BoldTD">Last Visited</td>
                        <td width="34%" class="BoldTD">Last Loaded</td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://google.com"></a>
                        </td>
                        <td width="33%">01/14/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          stackoverflow.com
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://stackoverflow.com"></a>
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                    </tbody></table>

Python code so far :

        f1 = open(PATH + "/" + FILE)
        pageSource = f1.read()
        f1.close()
        soup = BeautifulSoup(pageSource)
        alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
        print "Number of tables found : " , len(alltables)

        for table in alltables:
            rows = table.findAll('tr')
            for tr in rows:
                cols = tr.findAll('td')
                for td in cols:
                    print td.contents[0]


from BeautifulSoup import BeautifulSoup

pageSource='''...omitted for brevity...'''    

soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )

results=[]
for table in alltables:
    rows = table.findAll('tr')
    lines=[]
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            text=td.renderContents().strip('\n')
            lines.append(text)
    text_table='\n'.join(lines)
    if 'Website' in text_table:
        results.append(text_table) 
print "Number of tables found : " , len(results)
for result in results:
    print(result)

yields

Number of tables found :  1
Website
Last Visited
Last Loaded
<a href="http://google.com"></a>
01/14/2011

stackoverflow.com
01/10/2011

<a href="http://stackoverflow.com"></a>
01/10/2011

Is this close to what you are looking for? The problem was that td.contents returns a list of NavigableStrings and soup tags. For instance, running print(td.contents) might yield

['', '<a href="http://stackoverflow.com"></a>', '']

So picking off the first element of the list makes you miss the <a>-tag.


I answered a similar question here . Hope it will help you.

A lay man solution:

alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )

t = [x for x in soup.findAll('td')]

[x.renderContents().strip('\n') for x in t]

Output:

['Website',
 'Last Visited',
 'Last Loaded',
 '<a href="http://google.com"></a>',
 '01/14/2011\n                                ',
 '',
 '                          stackoverflow.com\n                        ',
 '01/10/2011\n                                ',
 '',
 '<a href="http://stackoverflow.com"></a>',
 '01/10/2011\n                                ',
 '']
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜