Remove a bad tag completely with html5lib.sanitizer

2023-03-06 07:49 问答作者：

I'm trying to use html5lib.sanitizer to clean user-input as suggested in the docs

The problem is I want to remove bad tags completely and not just escape them (which seems like a bad idea anyway).

The workaround suggested in 开发者_运维问答the patch here doesn't work as expected (it keeps inner content of a <tag>content</tag>).

Specifically, I want to do something like this:

Input:

<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world</h1>
Lorem ipsum

Output:

<h1>Hello world</h1>
Lorem ipsum

Any ideas on how to achieve it? I've tried BeautifulSoup, but it doesn't seem to work well, and lxml inserts <p></p> tags in very strange places (e.g. around src attrs). So far, html5lib seems to be the best thing for the purpose, if I could just get it to remove tags instead of escaping them.

The challenge is to also strip unwanted nested tags. It isn't pretty but it's a step in the right direction:

from lxml.html import fromstring
from lxml import etree

html = '''
<script>bad_thing();</script>
<style>* { background: #000; }</style>
<h1>Hello world<script>bad_thing();</script></h1>
Lorem ipsum
<script>bad_thing();</script>
<b>Bold Text</b>
'''

l = []
doc = fromstring(html)
for el in doc.xpath(".//h1|.//b"):
    i = etree.Element(el.tag)
    i.text, i.tail = el.text, el.tail
    l.append(etree.tostring(i))

print ''.join(l)

Which outputs:

<h1>Hello world</h1>
Lorem ipsum
<b>Bold Text</b>

继续阅读：html-sanitizing html5lib python sanitizer tokenize

Remove a bad tag completely with html5lib.sanitizer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？