Can this be improved? Scrubbing of dangerous html tags

2023-01-02 21:12 问答作者：

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.

I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.

I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.

So is this script missing anything?

It uses html agility pack.

public string ScrubHTML(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //Remove potentially harmful elements
    HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.ParentNode.RemoveChild(node, false);

        }
    }

    //remove hrefs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {

        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("href", "#");
        }
    }


    //remove img with refs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("src", "#");
        }
    }

    //remove on<Event> handlers from all tags
    nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("onFocus");
            node.Attributes.Remove("onBlur");
            node.Attributes.Remove("onClick");
            node.Attributes.Remove("onMouseOver");
            node.Attributes.Remove("onMouseOut");
            node.Attributes.Remove("onDoubleClick");
            node.Attributes.Remove("onLoad");
            node.Attributes.Remove("onUnload");
        }
    }

    // remove any style attributes that contain the word expression (IE evaluates this as script)
    nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("stYle");
        }
    }

    return doc.DocumentNode.WriteTo();
}

Edit

2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).

How do you make a white list? Is it just a list collection?
How do you actual parse out all html tags, script tags and every other tag?
Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.
Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.开发者_运维问答
Should I be using html agility pack?

That code is dangerous -- you should be whitelisting elements, not blacklisting them.

In other words, make a small list of tags and attributes you want to allow, and don't let any others through.

EDIT: I'm not familiar with HTML agility pack, but I see no reason why it wouldn't work for this. Since I don't know the framework, I'll give you pseudo-code for what you need to do.

doc.LoadHtml(html);

var validTags = new List<string>(new string[] {"b", "i", "u", "strong", "em"});

var nodes = doc.DocumentNode.SelectAllNodes();
foreach(HtmlNode node in nodes)
    if(!validTags.Contains(node.Tag.ToLower()))
        node.Parent.ReplaceNode(node, node.InnerHtml);

Basically, for each tag, if it's not contained in the whitelist, replace the tag with just its inner HTML. Again, I don't know your framework, so I can't really give you specifics, sorry. Hopefully this gets you started in the right direction.

Yes, I already see you're missing onmousedown, onmouseup, onchange, onsubmit, etc. This is part of why should use whitelisting for both tags and attributes. Even if you had a perfect blacklist now (very unlikely), tags and attributes are added fairly often.

See Why use a whitelist for HTML sanitizing?.

继续阅读：.net javascript security

Can this be improved? Scrubbing of dangerous html tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？