开发者

Getting html stripped of script and style tags with BeautifulSoup?

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.

soup = BeautifulSoup(html)
for script in soup("script"):
    soup.script.extract()

for style in soup("style"):
    开发者_如何学Csoup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)

contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?

Or is there an even better solution to accomplish what I want?


unicode( soup ) gives you the html.

Also what you want is this:

for elem in soup.findAll(['script', 'style']):
    elem.extract()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜