开发者

How to remove dangerous characters(ie script tags)?

I am wondering is there any sort of C# class or 3rd party library that removes dangerous characters such as script tags?

I know you can use regex but I also know people can write their script tags so many ways that you can fool the regex into thinking it is OK.

I also heard that HTML Agility Pack is good so I am wondering is there any script removal class made for it?

Edit

http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=24346

I found this on their forms. However I am not sure if this is complete solution as the guy does not have any tests to back it up and it would be nicer if this was on some site where tons of people where using this script every day to test to see if anything gets by.

Great example (almost), Thanks! A few ways to make it stronger that I saw, though:

1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:". For example, the original example would not remove the HTML:

<a href="JAVAscRipt:alert('hi')">click> me</a>

2) Remove any style attributes that contain an expression rule. Internet Explorer evaluates the CSS rule express as script. For example, the following would product a message box:

<div style="width:expression(alert('hi'));">bad> code</div>

3) Also remove tags

I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.) I just wish there was an easier/standard way to clean-up html input from a user.

Here's the code updated with these improvements. Let me know if you see anything wrong:

    public string ScrubHTML(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        //Remove potentially harmful elements
        HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.ParentNode.RemoveChild(node, false);

            }
        }

        //remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {

            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("href", "#");
            }
        }


        //remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("src", "#");
            }
        }

        //remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @on开发者_C百科blur or @onmouseout or @ondoubleclick or @onload or @onunload]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("onFocus");
                node.Attributes.Remove("onBlur");
                node.Attributes.Remove("onClick");
                node.Attributes.Remove("onMouseOver");
                node.Attributes.Remove("onMouseOut");
                node.Attributes.Remove("onDoubleClick");
                node.Attributes.Remove("onLoad");
                node.Attributes.Remove("onUnload");
            }
        }

        // remove any style attributes that contain the word expression (IE evaluates this as script)
        nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("stYle");
            }
        }

        return doc.DocumentNode.WriteTo();
    } 


We had the same problem: Users enter HTML and we want to display it inside our XHTML pages. Note that they enter HTML fragments and not complete documents. I did research on this back in 2010 using unit tests to test for many different cases.

Solution:

  1. Use Microsoft Anti-Cross Site Scripting Library to remove everything considered unsafe (mainly scripts). Note that this tool doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.
  2. Use Tidy.Net to make create almost valid XHTML.
  3. Remove html, head and body tags that Tidy.Net tends to create.
  4. Remove extra line breaks that Tidy.Net creates inside "pre" tags.

This will remove all JS and create something that in most cases is valid XHTML fragments. It will also remove all style tags.

The tools I tried have these problems:

Microsoft Anti-Cross Site Scripting Library: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order. Unfortunately not customizable.

Tidy.Net: Creates extra line breaks inside pre tags. (Can be fixed manually after running the tool.)

TidyForNet: Unstable. Sometimes gives you "Assertion faild in blabla.c"

Tidy (C-DLL) COM wrapper made in VB6: Impractical to say the least. You have to register the COM DLL.

HtmlAgilityPack: Inserts extra line breaks occasionally. Removes line breaks from pre tags.

Majestic12 HTML-parser: Doesn't close these tags: img, hr, br and sometimes it closes tags in the wrong order.

AntiSamy.Net: Impractical in that it uses components written in J# which is obsolete. Due to this it cannot run in a 64 bit environment. On the plus side it is very customizable regarding which tags and attribute values to allow.


How about Encoder.HtmlEncode? VS 2010 suggests it when trying working with AntiXss.HtmlEncode


string value = "Here alert('hello') we go. Visit the " + "http://west-wind.com'>West Wind site. " + "http://west-wind.com/images/new.gif' /> "; string safestring = Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value);

the above code will remove script tags from string


I would use built-in methods. As I see it, if a user wants to break your program, they will find a way to do it. But if you combine multiple methods of sanitizing user input, your program will only be more secure.

For instance, with a String variable named "myString", I would combine REGEX character stripping with just a regular manual character stripping by hand, just to be safe.

This will remove everything that isn't alphanumeric.

myString = Regex.Replace(myString, "[^a-z0-9]", "", RegexOptions.CaseInsensitive);
myString = myString.replace("/","");
myString = myString.replace("<","");

etc.

You could also extend this further by removing text that is between "<" and ">" characters and then between ">" and "<".

I prefer not to use external third-party libraries -unless I have to - because you have to distribute the library as well, you're relying on someone else's program to make yours secure, and if there's a vulnerability in their software yours is vulnerable too.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜