Fastest way to extract a substring in C#

2023-03-10 03:08 问答作者：

I'm going to process thousands of strings (with the size of ~150kB on average). Each of them contains zero or more substrings of the following form:

<a href="/link/i/want">Fixed_String&l开发者_运维技巧t;/a>

I would like to extract all such links and put them into a list.

Additionally, there is another fixed string after which the strings I am looking for will not appear.

What's the fastest way to get the strings?

The SubString() Option

As noted by Teoman Soygul, there is a SubString() Option, which I don't know if it is slower or faster as I did not test them side by side.

Now, this is not properly disected into sub methods but should give you the general idea.
I just use a ReadOnlyCollection cause it is what I'm used to when no further manipulation of the list is required. Change it to what ever output list type you like.

The someText variable is most likely going to end up a paramter of GetLinks off course.

public ReadOnlyCollection<string> GetLinks()
{
    string startingText = "href=''";
    string endText = "''>";
    string stopText = "Fixed_String";
    string someText = @"what is this text <a href=''/link/i/want''>somenormallink</a> some random text <a href=''/another link/i/want''>Fixed_String</a> some more radnom txt ";

    List<string> myLinks = new List<string>();

    string[] rawLinks = someText.Split(new string[] { "<a " }, StringSplitOptions.None);

    foreach (string rawLink in rawLinks)
    {
        if (!rawLink.StartsWith(startingText))
        {
            continue;
        }

        myLinks.Add(rawLink.Substring(startingText.Length, rawLink.IndexOf(endText, 1) - startingText.Length));


        if (rawLink.Contains(stopText))
        {
            break;
        }
    }


    return new ReadOnlyCollection<string>(myLinks);
}

That Results in a Collection containing the links:

Fastest way to extract a substring in C#

Assuming that strings are of properly formatted HTML format, you can easily parse them with XmlReader class which is non-cached and forward only (which makes it very very fast). You just seek for the proper node at retrieve its 'href' attribute's value.

You can also use normal string manipulation like .SubString() but then you'll have to write many subroutines to handle exceptional cases. The point here is to avoid RegEx since it will be the slowest of the bunch.

A bit of manual parsing is probably the fastest way to solve this. Regex would be possible, too, because it's really just a very simple case of parsing a link and not a whole HTML document, but that could easily choke on those large files, performance wise.

Now, let me preface this that I haven't tested this at all and I feel kinda dirty posting it (I'm sure it needs some more edge-case checks to avoid errors), but here you go:

    const char[] quotes = new char[] { '"', '\'' };

    private List<string> ExtractLinks(string html)
    {
        var links = new List<string>();
        string searchFor = ">Fixed_String</a>";

        for (int i = html.IndexOf(searchFor); i >= 0; i = html.IndexOf(searchFor, i + searchFor.Length))
        {
            string href = ExtractHref(html, i);
            if (!String.IsNullOrEmpty(href))
                links.Add(href);
        }

        return links;
    }

    private string ExtractHref(string html, int backtrackFrom)
    {
        int hrefStart = -1;

        // Find "<a", but limit search so we don't backtrack forever
        for (int i = backtrackFrom; i > backtrackFrom - 255; i--)
        {
            if (i < 0)
                return null;

            if (html[i] == '<' && html[i + 1] == 'a')
            {
                hrefStart = html.IndexOf("href=", i);
                break;
            }
        }

        if (hrefStart < 0)
            return null;

        int start = html.IndexOfAny(quotes, hrefStart);
        if (start < 0)
            return null;

        int end = html.IndexOfAny(quotes, start + 1);
        if (end < 0)
            return null;

        return html.Substring(start + 1, end - start - 1);
    }

XmlReader is probably a no-go, because you most likely cannot guarantee that those files are XHTML formatted. If you want to do proper parsing, the HTML Agility Pack is probably your best choice, or maybe a properly done Regex if it can't be helped. I posted this manual parsing so you have another alternative you can do a performance test with.

Generally Regex is faster with small files. If file size gets bigger (>~60Kb in my experience) then Regex becomes slower (Even static, compiled etc.). Found exact situation described in very good English:

Stripping Out Empty XmlElements in a Performant Way and the Bus Factor

Have fun discovering what is "High Bus Factor". It brought me a good mood for a day.

I think in this case where there are strings which are large enough on average and which contains zero or more substrings the best way will be to use Regex class like this:

string anchorPattern = @"<(.|/)a(.|\n)+?>";

foreach (string str in strings)
{
    Regex regex = new Regex(anchorPattern);

    foreach (Match match in regex.Matches(str))
    {
         // do here what you want with substring in match.Value
    }

}

From benchmarking, the optimal way of generating a substring is using ReadOnlySpans, not using string.Split

string.Split is much slower and writes a lot to memory while Readonly spans only write to heap.

|                  Method |      Mean |    Error |    StdDev |    Median |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------ |----------:|---------:|----------:|----------:|-------:|------:|------:|----------:|
|    SpanParseLongVersion |  17.84 ns | 0.385 ns |  0.674 ns |         - |      - |     - |     - |         - |
| ParseLongFWVersionSplit |  95.74 ns | 1.928 ns |  3.274 ns |  95.05 ns | 0.0373 |     - |     - |     176 B |

public const string FWLongVersion= "FWabcdefghijklmnopqrstuvwxyz-1.0000000000001";

[Benchmark]
public void SpanParseLongVersion()
{
    var dashChar = '-';
    var vSpan = FWLongVersion.AsSpan();
    var length = FWLongVersion.Length;

    ReadOnlySpan<char> fwVersion = null;

    for (int x = 0; x < length; x++)
    {
        if (vSpan[x] == dashChar)
        {
            fwVersion = vSpan.Slice(x + 1, 5);
            break;
        }
    }
}

[Benchmark]
public void ParseLongFWVersionSplit()
{
    var x = FWLongVersion.Split('-');
}

also, using string.StartsWith and string.Contains is not great....for how to efficiently check this, please check my post here: https://stackoverflow.com/a/64395744/4926590

继续阅读：regex string

Fastest way to extract a substring in C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？