Optimising HTML tag removal in C#

2023-04-01 16:31 问答作者：

I have some code that removes HTML tags from text. I don't care about the content (script, css, text etc), the important thing, at least for now, is that the tags themselves are stripped out.

This may be entering the theatre of micro-optimisation, however this code is among a small number of functions that will be running very often against large amounts of data, so any percentage saving may carry through to a useful saving from the overall application's perspective.

The code at present looks like this:

public static string StripTags(string html)
{
    va开发者_StackOverflow中文版r currentIndex = 0;
    var insideTag = false;
    var output = new char[html.Length];

    for (int i = 0; i < html.Length; i++)
    {
        var c = html[i];
        if (c == '>')
        {
            insideTag = false;
            continue;
        }
        if (!insideTag)
        {
            if (c == '<')
            {
                insideTag = true;
                continue;
            }
            output[currentIndex] = c;
            currentIndex++;
        }
    }
    return new string(output, 0, currentIndex);
}

Are there any obvious .net tricks I'm missing out on here? For info this is using .net 4.

Many thanks.

In this code you copy chars one by one. You might be able to speed it up considerably by only checking where the current section (inside or outside html) ends and then use Array.copy to move that whole chunk in one go, this would enable lower level optimizations. (for instance on 64 bit it could copy 4 unicode chars (4 * 2* 8 bit) in one processor cycle). The bits of text in between the tags are probably quite large so this could add up.

Also the stringbuilder documentation mentioned somewhere that becuase it's implemented in the framework and not in C# it has perfomance that you can't replicate in managed C#. Not sure how you could append a chunk you might look into that.

Regards Gert-Jan

You should take a look at the following library as it seems to be the best way to interact with html files in .NET: http://htmlagilitypack.codeplex.com/

Do not solve a non existing problem.

How many times will this method be called? Many! How many? Several thousands? Not enough to warrant optimization.

Can you just do a Parallel.For and speed it up 3-5 times depending on machine? Possibly.

Is your code dependent on lots of other code? Certainly.

Is it possible that you have this:

// Some slow code
StripTags(s); // Super fast version
// Some more slow code here

Will it matter then how fast is your StripTags?

Are you getting them from a file? Are you getting them from a network? Very rarely the bottleneck is your raw CPU power.

Let me repeat myself:

Do not solve a non existing problem!

You can also encode it:

string encodedString = Server.HtmlEncode(stringToEncode);

Have a look here: http://msdn.microsoft.com/en-us/library/ms525347%28v=vs.90%29.aspx

Googling for remove html from string yields many links that talk about using Regular Expressions all similar to the following:

public string Strip(string text)
{
    return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);
}

继续阅读：optimization

Optimising HTML tag removal in C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？