开发者

Optimising HTML tag removal in C#

I have some code that removes HTML tags from text. I don't care about the content (script, css, text etc), the important thing, at least for now, is that the tags themselves are stripped out.

This may be entering the theatre of micro-optimisation, however this code is among a small number of functions that will be running very often against large amounts of data, so any percentage saving may carry through to a useful saving from the overall application's perspective.

The code at present looks like this:

public static string StripTags(string html)
{
    va开发者_StackOverflow中文版r currentIndex = 0;
    var insideTag = false;
    var output = new char[html.Length];

    for (int i = 0; i < html.Length; i++)
    {
        var c = html[i];
        if (c == '>')
        {
            insideTag = false;
            continue;
        }
        if (!insideTag)
        {
            if (c == '<')
            {
                insideTag = true;
                continue;
            }
            output[currentIndex] = c;
            currentIndex++;
        }
    }
    return new string(output, 0, currentIndex);
}

Are there any obvious .net tricks I'm missing out on here? For info this is using .net 4.

Many thanks.


In this code you copy chars one by one. You might be able to speed it up considerably by only checking where the current section (inside or outside html) ends and then use Array.copy to move that whole chunk in one go, this would enable lower level optimizations. (for instance on 64 bit it could copy 4 unicode chars (4 * 2* 8 bit) in one processor cycle). The bits of text in between the tags are probably quite large so this could add up.

Also the stringbuilder documentation mentioned somewhere that becuase it's implemented in the framework and not in C# it has perfomance that you can't replicate in managed C#. Not sure how you could append a chunk you might look into that.

Regards Gert-Jan


You should take a look at the following library as it seems to be the best way to interact with html files in .NET: http://htmlagilitypack.codeplex.com/


Do not solve a non existing problem.

How many times will this method be called? Many! How many? Several thousands? Not enough to warrant optimization.

Can you just do a Parallel.For and speed it up 3-5 times depending on machine? Possibly.

Is your code dependent on lots of other code? Certainly.

Is it possible that you have this:

// Some slow code
StripTags(s); // Super fast version
// Some more slow code here

Will it matter then how fast is your StripTags?

Are you getting them from a file? Are you getting them from a network? Very rarely the bottleneck is your raw CPU power.

Let me repeat myself:

Do not solve a non existing problem!


You can also encode it:

string encodedString = Server.HtmlEncode(stringToEncode);

Have a look here: http://msdn.microsoft.com/en-us/library/ms525347%28v=vs.90%29.aspx


Googling for remove html from string yields many links that talk about using Regular Expressions all similar to the following:

public string Strip(string text)
{
    return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜