Getting html stripped of script and style tags with BeautifulSoup?
I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.
soup = BeautifulSoup(html)
for script in soup("script"):
soup.script.extract()
for style in soup("style"):
开发者_如何学Csoup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)
contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents
list and piece the html back together excluding the script & style tags?
Or is there an even better solution to accomplish what I want?
unicode( soup )
gives you the html.
Also what you want is this:
for elem in soup.findAll(['script', 'style']):
elem.extract()
精彩评论