Clean user HTML in .net
My C# site allows users to sub开发者_如何学Cmit HTML to be displayed on the site. I would like to limit the tags and attributes allowed for the HTML, but am unable to figure out how to do this in .net.
I've tried using Html Agility Pack, but I don't see how to modify the HTML, I can see how to go through the HTML and find certain data, but actually generating an output file is baffling me.
Does anyone have a good example for cleaning up HTML in .net? The agility pack might be the answer, but the documentation is lacking.
I would strongly recommend Microsoft's Anti-XSS Library for santizing input. It supports sanitizing html.
You should only accept well-formed HTML.
You can then use LINQ to XML to parse and modify it.
You can make a recursive function that takes an element from the user and returns a new element with a whitelisted set of tags and attributes.
For example:
//Maps allowed tags to allowed attributes for the tags.
static readonly Dictionary<string, string[]> AllowedTags = new Dictionary<string, string[]>(StringComparer.OrdinalIgnoreCase) {
{ "b", new string[0] },
{ "img", new string[] { "src", "alt" } },
//...
};
static XElement CleanElement(XElement dirtyElement) {
return new XElement(dirtyElem.Name,
dirtyElement.Elements
.Where(e => AllowedTags.ContainsKey(e.Name))
.Select<XElement, XElement>(CleanElement)
.Concat(
dirtyElement.Attributes
.Where(a => AllowedTags[dirtyElem.Name].Contains(a.Name, StringComparer.OrdinalIgnoreCase))
);
}
If you allow hyperlinks, make sure to disallow javascript:
urls; this code doesn't do that.
With HtmlAgilityPack you can remove unwanted tags from the input:
node.ParentNode.RemoveChild(node);
A tool you can use that is available off of SourceForge is SGMLReader which turns the HTML into properly formatted XML and allows you to read it as an XmlReader or load it into an XmlDocument object for further processing. I have used this before for parsing web pages which are not always in properly formatted HTML.
Have you had a look at MarkdownSharp which is Open Source and created by the guys here?
Jeff Atwood posted his whitelist-based approach on Refactor My Code at http://refactormycode.com/codes/333-sanitize-html
I believe StackOverflow combines that with the tag-balancing code at http://refactormycode.com/codes/360-balance-html-tags for sanitizing posts and preparing them for display. And, of course, they use MarkdownSharp for enabling Markdown on posts.
精彩评论