How to remove dangerous characters(ie script tags)?

2023-01-01 20:59 问答作者：

I am wondering is there any sort of C# class or 3rd party library that removes dangerous characters such as script tags?

I know you can use regex but I also know people can write their script tags so many ways that you can fool the regex into thinking it is OK.

I also heard that HTML Agility Pack is good so I am wondering is there any script removal class made for it?

Edit

http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=24346

I found this on their forms. However I am not sure if this is complete solution as the guy does not have any tests to back it up and it would be nicer if this was on some site where tons of people where using this script every day to test to see if anything gets by.

Great example (almost), Thanks! A few ways to make it stronger that I saw, though:

1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:". For example, the original example would not remove the HTML:
<a href="JAVAscRipt:alert('hi')">click> me</a>
2) Remove any style attributes that contain an expression rule. Internet Explorer evaluates the CSS rule express as script. For example, the following would product a message box:
<div style="width:expression(alert('hi'));">bad> code</div>
3) Also remove tags

I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.) I just wish there was an easier/standard way to clean-up html input from a user.

Here's the code updated with these improvements. Let me know if you see anything wrong:

    public string ScrubHTML(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        //Remove potentially harmful elements
        HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.ParentNode.RemoveChild(node, false);

            }
        }

        //remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {

            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("href", "#");
            }
        }


        //remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("src", "#");
            }
        }

        //remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @on开发者_C百科blur or @onmouseout or @ondoubleclick or @onload or @onunload]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("onFocus");
                node.Attributes.Remove("onBlur");
                node.Attributes.Remove("onClick");
                node.Attributes.Remove("onMouseOver");
                node.Attributes.Remove("onMouseOut");
                node.Attributes.Remove("onDoubleClick");
                node.Attributes.Remove("onLoad");
                node.Attributes.Remove("onUnload");
            }
        }

        // remove any style attributes that contain the word expression (IE evaluates this as script)
        nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("stYle");
            }
        }

        return doc.DocumentNode.WriteTo();
    }

We had the same problem: Users enter HTML and we want to display it inside our XHTML pages. Note that they enter HTML fragments and not complete documents. I did research on this back in 2010 using unit tests to test for many different cases.

Solution:

Use Microsoft Anti-Cross Site Scripting Library to remove everything considered unsafe (mainly scripts). Note that this tool doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.
Use Tidy.Net to make create almost valid XHTML.
Remove html, head and body tags that Tidy.Net tends to create.
Remove extra line breaks that Tidy.Net creates inside "pre" tags.

This will remove all JS and create something that in most cases is valid XHTML fragments. It will also remove all style tags.

The tools I tried have these problems:

Microsoft Anti-Cross Site Scripting Library: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order. Unfortunately not customizable.

Tidy.Net: Creates extra line breaks inside pre tags. (Can be fixed manually after running the tool.)

TidyForNet: Unstable. Sometimes gives you "Assertion faild in blabla.c"

Tidy (C-DLL) COM wrapper made in VB6: Impractical to say the least. You have to register the COM DLL.

HtmlAgilityPack: Inserts extra line breaks occasionally. Removes line breaks from pre tags.

Majestic12 HTML-parser: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.

AntiSamy.Net: Impractical in that it uses components written in J# which is obsolete. Due to this it cannot run in a 64 bit environment. On the plus side it is very customizable regarding which tags and attribute values to allow.

How about Encoder.HtmlEncode? VS 2010 suggests it when trying working with AntiXss.HtmlEncode

string value = "Here alert('hello') we go. Visit the " + "http://west-wind.com'>West Wind site. " + "http://west-wind.com/images/new.gif' /> "; string safestring = Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value);

the above code will remove script tags from string

I would use built-in methods. As I see it, if a user wants to break your program, they will find a way to do it. But if you combine multiple methods of sanitizing user input, your program will only be more secure.

For instance, with a String variable named "myString", I would combine REGEX character stripping with just a regular manual character stripping by hand, just to be safe.

This will remove everything that isn't alphanumeric.

myString = Regex.Replace(myString, "[^a-z0-9]", "", RegexOptions.CaseInsensitive);
myString = myString.replace("/","");
myString = myString.replace("<","");

etc.

You could also extend this further by removing text that is between "<" and ">" characters and then between ">" and "<".

I prefer not to use external third-party libraries -unless I have to - because you have to distribute the library as well, you're relying on someone else's program to make yours secure, and if there's a vulnerability in their software yours is vulnerable too.

继续阅读：.net asp.net-mvc javascript security

How to remove dangerous characters(ie script tags)?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？