开发者

manipulating string content value in html file using beautifulsoup

Folks

I am new to python and beautifulsoup - so please bear with me. I am trying to do some html parsing.

I would like to remove newlines and compact whitespace from selec开发者_开发问答ted attributes (based on a string search within an html file.

For example, for the following html, I would like to search for all tags with a string attribute "xy" and then remove newlines and multiple spaces from that string (replace with a single space.

<html>   
    <head></head>   
    <body>
    <h1>xy
        z</h1>
    <p>xy
        z</p>
    <div align="center" style="margin-left: 0%; ">
      <b>
       <font style="font-family: 'Times New Roman', Times">
        ab    c
       </font>
       <font style="font-family: 'Times New Roman', Times">
        xy    z
       </font>
      </b>
     </div>  
    </body> 
</html>

The resulting html should look like:

<html>   
  <head></head>   
  <body>
    <h1>xy z</h1>
    <p>xy z</p>
    <div align="center" style="margin-left: 0%; ">
      <b>
       <font style="font-family: 'Times New Roman', Times">
        ab    c
       </font>
       <font style="font-family: 'Times New Roman', Times">
        xy z
       </font>
      </b>
     </div>   
  </body> 
</html>


OK - I found a way to do it...You use findall and then use the replaceWith() method as shown below.

......... soup = BeautifulSoup(contents) s = soup.findAll(text=re.compile("xy"))
for s1 in s:
s1.replaceWith(re.sub('\s+', ' ', str(s1)))
...........

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜