Remove <br> tags from a parsed Beautiful Soup list?
I'm 开发者_JS百科currently getting into a for loop with all the rows I want:
page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
At this point, I have my information, but the
<br />
tags are ruining my output.
What's the cleanest way to remove these?
for e in soup.findAll('br'):
e.extract()
If you want to translate the <br />
's to newlines, do something like this:
def text_with_newlines(elem):
text = ''
for e in elem.recursiveChildGenerator():
if isinstance(e, basestring):
text += e.strip()
elif e.name == 'br':
text += '\n'
return text
replace tags at the start with a space Beautiful soup also accepts the .read() on the urlopen object so this should work - - -
page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....
the re.sub replaces the br tag with a whitespace
Maybe some_string.replace('<br />','\n')
to replace the breaks with newlines.
>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data
You might want to check out html5lib and lxml, which are both pretty great at parsing html. lxml is really fast and html5lib is designed to be extremely robust.
精彩评论