manipulating string content value in html file using beautifulsoup
Folks
I am new to python and beautifulsoup - so please bear with me. I am trying to do some html parsing.
I would like to remove newlines and compact whitespace from selec开发者_开发问答ted attributes (based on a string search within an html file.
For example, for the following html, I would like to search for all tags with a string attribute "xy" and then remove newlines and multiple spaces from that string (replace with a single space.
<html>
<head></head>
<body>
<h1>xy
z</h1>
<p>xy
z</p>
<div align="center" style="margin-left: 0%; ">
<b>
<font style="font-family: 'Times New Roman', Times">
ab c
</font>
<font style="font-family: 'Times New Roman', Times">
xy z
</font>
</b>
</div>
</body>
</html>
The resulting html should look like:
<html>
<head></head>
<body>
<h1>xy z</h1>
<p>xy z</p>
<div align="center" style="margin-left: 0%; ">
<b>
<font style="font-family: 'Times New Roman', Times">
ab c
</font>
<font style="font-family: 'Times New Roman', Times">
xy z
</font>
</b>
</div>
</body>
</html>
OK - I found a way to do it...You use findall and then use the replaceWith() method as shown below.
.........
soup = BeautifulSoup(contents)
s = soup.findAll(text=re.compile("xy"))
for s1 in s:
s1.replaceWith(re.sub('\s+', ' ', str(s1)))
...........
精彩评论