开发者

Java: Best way to remove Javascript from HTML

What's the best library/approach for removing Javascript from HTML that will be displayed?

For example, take:

<html&开发者_Go百科gt;<body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

<html><body><span>test</span></body></html>

I see the DeXSS project. But is that the best way to go?


JSoup has a simple method for sanitizing HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:

There are still a number of known XSS attacks that DeXSS does not yet detect.

A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.


The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.

Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.


Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).


You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.

But obviously doing all this involves some overhead if you're doing this at page render time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜