Search and Replace in HTML with BeautifulSoup
I want to use BeautfulSoup to search and replace <\a>
with <\a><br>
. I know how to open with urllib2
and then parse to extract all the <a>
tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So开发者_StackOverflow I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much
This will insert a <br>
tag after the end of each <a>...</a>
element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>')
because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a>
elements inside a <p>
element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p>
and </p>
separately because they are part of the same thing.
suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))
You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br>
element immediately after the <a>...</a>
element. To do so you'll need to find out the index of the <a>
element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah <a href="foo">blah</a> blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah <a href="foo">blah</a><br /> blah</body>'
精彩评论