Remove a bad tag completely with html5lib.sanitizer
I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs
The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway).
The workaround suggested in 开发者_运维问答the patch here doesn't work as expected (it keeps inner content of a <tag>content</tag>
).
Specifically, I want to do something like this:
Input:
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum
Output:
<h1>Hello world</h1>
Lorem ipsum
Any ideas on how to achieve it? I've tried BeautifulSoup, but it doesn't seem to work well, and lxml inserts <p></p>
tags in very strange places (e.g. around src attrs). So far, html5lib seems to be the best thing for the purpose, if I could just get it to remove tags instead of escaping them.
The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:
from lxml.html import fromstring
from lxml import etree
html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''
l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
i = etree.Element(el.tag)
i.text, i.tail = el.text, el.tail
l.append(etree.tostring(i))
print ''.join(l)
Which outputs:
<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>
精彩评论