开发者

Remove <br> tags from a parsed Beautiful Soup list?

I'm 开发者_JS百科currently getting into a for loop with all the rows I want:

page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):

At this point, I have my information, but the

<br />

tags are ruining my output.

What's the cleanest way to remove these?


for e in soup.findAll('br'):
    e.extract()


If you want to translate the <br />'s to newlines, do something like this:

def text_with_newlines(elem):
    text = ''
    for e in elem.recursiveChildGenerator():
        if isinstance(e, basestring):
            text += e.strip()
        elif e.name == 'br':
            text += '\n'
    return text


replace tags at the start with a space Beautiful soup also accepts the .read() on the urlopen object so this should work - - -

page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....

the re.sub replaces the br tag with a whitespace


Maybe some_string.replace('<br />','\n') to replace the breaks with newlines.

>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data

You might want to check out html5lib and lxml, which are both pretty great at parsing html. lxml is really fast and html5lib is designed to be extremely robust.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜