advantages from htmlpurifier instead of regex filtering

2023-01-10 19:33 问答作者：

We have recently implemented htmlpurifier in our web-based application. Earlier we used to have regexes to match commonly known XSS injections (script, img, etc. etc). We realized that this wasn't good enough and hence moved to htmlpurifier. Now given that htmlpurifier is slow in working (very slow compared to the regex method we had earlier), is it really worth to have htmlpurifier? Or does it make any sense to keep increasing the reg开发者_Python百科ex filtering until we reach a satisfactory level (it might be argued that the speed benefits would be nullified by that time). Anyone else who has faced similar issues with security for their web application and what did you do in the end?

Please let know if anything seems vague; I would be happy to provide more details.

Using a regex for html/javascript? Perhaps you have not seen this epic answer by Mr Bobice. In short if you use a regex then you have two problems. In fact the reason why HTML Purifier is so slow is because it uses hundreds of calls to preg_match() and preg_repalce() in order to clean a message. You must never re-invent the wheal, without a doubt be less secure.

The real question is htmlspeicalchars($var,ENT_QUOTES); vs HTML Purifier. HTML Purifer is not only slow, it has been hacked, many times. Don't use HTML Purifier unless there is no other choice, htmlspeicalchars solves most problems and it solves it in a way that cannot be bypassed.

The problem with regexes is that filtering HTML is too complex a task to be able to do easily, or elegantly, with regexes without creating a big mess.

You need to build something that actually understands HTML and can operate on it as HTML, and know how a browser is going to interpret something. Regexes operate on it as if it's just one big long string. They're not good or elegant at parsing HTML in a stateful manner, for example recognising that a current match is within a comment, or within an attribute, or within a element etc. It's just really complicated to emulate that in regexes.

The other issue is that 'matching commonly known XSS injections' is way more complex than it sounds. If it isn't, you're not doing it right. Your filter needs to know HTML, it needs to know what a valid URL scheme is and how null bytes work in different parts of HTML etc. Basically, most of the injections on the XSS cheat sheet, for example, are based on getting around filtering done by regex-based filters.

And one more thing is that HTML purifier is maintained by someone who knows what they're doing. You can trust it, and you can trust that if there's a new flaw in it it'll be patched. That can save you a lot of work trying to do the same thing on your own and ensure that you remain up to date with all of the different patches out there.

It's better to be safe than sorry. There's a whole slew of attacks your regular expressions might not find. For example, here's just a few. If HTML Purifier is too slow, see if caching the purified HTML helps.

继续阅读：htmlpurifier php security

advantages from htmlpurifier instead of regex filtering

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？