开发者

ASP.NET Trying to strip out html from a string

So I have a cms where users can enter content through cuteeditor, which works fine and then display this data on my website. One thing which happens rarely but annoying is that users enter certain mark up in their text which makes the font look different than the other fonts on the page i.e.

<span style="font-size: 11pt">Special Olympics Ireland provides year round sports training and athletic competition&nbsp;in a variety of Olympic&nbsp;type sports&nbsp;for persons with&nbsp;intellectual&nbsp;disabilities&nbsp;in </span><span style="font-size: 11pt">Ireland</span><span style="font-size: 11pt"> and </span><span style="font-size: 11pt">Northern Ireland</span><span style="font-size: 11pt"> in accordance with and furtherance of the mission, goal and founding principles of the international Special Olympics movement.</span> 

Basically what I want to do is

String.Replace("<span style="font-size: 11pt"开发者_如何学C>","")

But ofcourse that will only capture the above case the next time they could use font size 8,9 or 10 whatever so the filter method would have to be smart like that.

Any ideas ?

SO at the moment I have something like testSpan = Regex.Replace(testSpan, @"\s]+))?)+\s*|\s*)/?>", String.Empty);

But it gets rid of all html basically I just want to get rid of tags


You should really use a proper HTML parser for this sort of thing.


If you want to follow StackOverflow's example, you could create a white-list of allowed HTML tags, and strip out the rest.

Following are the code snippets Jeff Atwood uses to sanitize and balance HTML tags in StackOverflow's user-generated content.

  • Sanitize http://refactormycode.com/codes/333-sanitize-html
  • Balance http://refactormycode.com/codes/360-balance-html-tags
  • List of allowed tags https://meta.stackexchange.com/questions/1777/what-html-tags-are-allowed

Update

It looks like Refactormycode is dead. Here's some code that I captured before that happened:

/// <summary>
/// Provides some static extension methods for processing strings with HTML in them.
/// </summary>
public static class HtmlStripper
{
    #region Sanitize

    private static readonly Regex Tags = new Regex("<[^>]*(>|$)",
        RegexOptions.Singleline | RegexOptions.ExplicitCapture |
        RegexOptions.Compiled);

    private static readonly Regex Whitelist =
        new Regex(
            @"
^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
^<(b|h)r\s?/?>$",
            RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled |
            RegexOptions.IgnorePatternWhitespace);

    private static readonly Regex WhitelistA =
        new Regex(
            @"
^<a\s
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""
(\stitle=""[^""<>]+"")?(\starget=""[^""<>]+"")?\s?>$|
^</a>$",
            RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled |
            RegexOptions.IgnorePatternWhitespace);

    private static readonly Regex WhitelistImg =
        new Regex(
            @"
^<img\s
src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
(\swidth=""\d{1,3}"")?
(\sheight=""\d{1,3}"")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\s?/?>$",
            RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled |
            RegexOptions.IgnorePatternWhitespace);


    /// <summary>
    /// sanitize any potentially dangerous tags from the provided raw HTML input using 
    /// a whitelist based approach, leaving the "safe" HTML tags
    /// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
    /// </summary>
    /// <remarks>
    /// Based on Jeff Atwood's code, found at http://refactormycode.com/codes/333-sanitize-html
    /// Since Jeff Atwood is StackOverflow's administrator, this is most likely the code used by
    /// that site. See http://meta.stackoverflow.com/questions/1777/what-html-tags-are-allowed
    /// for a list of allowed tags.
    /// </remarks>
    public static string SanitizeHtml(string html)
    {
        if (String.IsNullOrEmpty(html)) return html;

        // match every HTML tag in the input
        MatchCollection tags = Tags.Matches(html);
        for (int i = tags.Count - 1; i > -1; i--)
        {
            Match tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (!(Whitelist.IsMatch(tagname) || WhitelistA.IsMatch(tagname) || WhitelistImg.IsMatch(tagname)))
            {
                html = html.Remove(tag.Index, tag.Length);
            }
        }

        return html;
    }

    #endregion

    #region Balance tags

    private static readonly Regex Namedtags = new Regex
        (@"</?(?<tagname>\w+)[^>]*(\s|$|>)",
            RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// attempt to balance HTML tags in the html string
    /// by removing any unmatched opening or closing tags
    /// IMPORTANT: we *assume* HTML has *already* been 
    /// sanitized and is safe/sane before balancing!
    /// 
    /// CODESNIPPET: A8591DBA-D1D3-11DE-947C-BA5556D89593
    /// </summary>
    /// <remarks>
    /// From Jeff Atwood's post at 
    /// http://refactormycode.com/codes/360-balance-html-tags
    /// </remarks>
    public static string BalanceTags(string html)
    {
        if (String.IsNullOrEmpty(html)) return html;

        // convert everything to lower case; this makes
        // our case insensitive comparisons easier
        MatchCollection tags = Namedtags.Matches(html.ToLowerInvariant());

        // no HTML tags present? nothing to do; exit now
        int tagcount = tags.Count;
        if (tagcount == 0) return html;

        const string ignoredtags = "<p><img><br><li><hr>";
        var tagpaired = new bool[tagcount];
        var tagremove = new bool[tagcount];

        // loop through matched tags in forward order
        for (int ctag = 0; ctag < tagcount; ctag++)
        {
            string tagname = tags[ctag].Groups["tagname"].Value;

            // skip any already paired tags
            // and skip tags in our ignore list; assume they're self-closed
            if (tagpaired[ctag] || ignoredtags.Contains("<" + tagname + ">")) continue;

            string tag = tags[ctag].Value;
            int match = -1;

            if (tag.StartsWith("</"))
            {
                // this is a closing tag
                // search backwards (previous tags), look for opening tags
                for (int ptag = ctag - 1; ptag >= 0; ptag--)
                {
                    string prevtag = tags[ptag].Value;
                    if (!tagpaired[ptag] && prevtag.Equals("<" + tagname, StringComparison.InvariantCulture))
                    {
                        // minor optimization; we do a simple possibly incorrect match above
                        // the start tag must be <tag> or <tag{space} to match
                        if (prevtag.StartsWith("<" + tagname + ">") || prevtag.StartsWith("<" + tagname + " "))
                        {
                            match = ptag;
                            break;
                        }
                    }
                }
            }
            else
            {
                // this is an opening tag
                // search forwards (next tags), look for closing tags
                for (int ntag = ctag + 1; ntag < tagcount; ntag++)
                {
                    if (!tagpaired[ntag] &&
                        tags[ntag].Value.Equals("</" + tagname + ">", StringComparison.InvariantCulture))
                    {
                        match = ntag;
                        break;
                    }
                }
            }

            // we tried, regardless, if we got this far
            tagpaired[ctag] = true;
            if (match == -1) tagremove[ctag] = true; // mark for removal
            else tagpaired[match] = true; // mark paired
        }

        // loop through tags again, this time in reverse order
        // so we can safely delete all orphaned tags from the string
        for (int ctag = tagcount - 1; ctag >= 0; ctag--)
        {
            if (tagremove[ctag])
            {
                html = html.Remove(tags[ctag].Index, tags[ctag].Length);
            }
        }

        return html;
    }

    #endregion
}


Heres a function I use to strip HTML from a string in VB.NET:

Public Shared Function StripHTML(ByVal htmlString As String) As String
     Dim pattern As String = "<(.|\n)*?>"
     Return Regex.Replace(htmlString, pattern, String.Empty)
End Function

Hope it helps


For this specific case, you can do something like this

   String input = @"<span style=""font-size: 11pt"">Special Olympics Ireland provides year round sports training and athletic competition&nbsp;in a variety of Olympic&nbsp;type sports&nbsp;for persons with&nbsp;intellectual&nbsp;disabilities&nbsp;in </span><span style=""font-size: 11pt"">Ireland</span><span style=""font-size: 11pt""> and </span><span style=""font-size: 11pt"">Northern Ireland</span><span style=""font-size: 11pt""> in accordance with and furtherance of the mission, goal and founding principles of the international Special Olympics movement.</span>";
   var element = XElement.Parse(input.Replace("&nbsp;"," "));
   string stripped = element.Value;

but usually you don't want to deal with any kind of string manipulation or parsing on html directly. it's best to use a parser as pointed out in other answers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜