Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)
I am looking for a python module that will help me get rid of HTML tags but keep the text values. I 开发者_Python百科tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e
>>> import BeautifulSoup
>>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist)
.
If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.
Using lxml:
htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text
#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image
Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'
精彩评论