开发者

Fastest way to extract a substring in C#

I'm going to process thousands of strings (with the size of ~150kB on average). Each of them contains zero or more substrings of the following form:

<a href="/link/i/want">Fixed_String&l开发者_运维技巧t;/a>

I would like to extract all such links and put them into a list.

Additionally, there is another fixed string after which the strings I am looking for will not appear.

What's the fastest way to get the strings?


The SubString() Option

As noted by Teoman Soygul, there is a SubString() Option, which I don't know if it is slower or faster as I did not test them side by side.

Now, this is not properly disected into sub methods but should give you the general idea.
I just use a ReadOnlyCollection cause it is what I'm used to when no further manipulation of the list is required. Change it to what ever output list type you like.

The someText variable is most likely going to end up a paramter of GetLinks off course.

public ReadOnlyCollection<string> GetLinks()
{
    string startingText = "href=''";
    string endText = "''>";
    string stopText = "Fixed_String";
    string someText = @"what is this text <a href=''/link/i/want''>somenormallink</a> some random text <a href=''/another link/i/want''>Fixed_String</a> some more radnom txt ";

    List<string> myLinks = new List<string>();

    string[] rawLinks = someText.Split(new string[] { "<a " }, StringSplitOptions.None);

    foreach (string rawLink in rawLinks)
    {
        if (!rawLink.StartsWith(startingText))
        {
            continue;
        }

        myLinks.Add(rawLink.Substring(startingText.Length, rawLink.IndexOf(endText, 1) - startingText.Length));


        if (rawLink.Contains(stopText))
        {
            break;
        }
    }


    return new ReadOnlyCollection<string>(myLinks);
}

That Results in a Collection containing the links:

Fastest way to extract a substring in C#


Assuming that strings are of properly formatted HTML format, you can easily parse them with XmlReader class which is non-cached and forward only (which makes it very very fast). You just seek for the proper node at retrieve its 'href' attribute's value.

You can also use normal string manipulation like .SubString() but then you'll have to write many subroutines to handle exceptional cases. The point here is to avoid RegEx since it will be the slowest of the bunch.


A bit of manual parsing is probably the fastest way to solve this. Regex would be possible, too, because it's really just a very simple case of parsing a link and not a whole HTML document, but that could easily choke on those large files, performance wise.

Now, let me preface this that I haven't tested this at all and I feel kinda dirty posting it (I'm sure it needs some more edge-case checks to avoid errors), but here you go:

    const char[] quotes = new char[] { '"', '\'' };

    private List<string> ExtractLinks(string html)
    {
        var links = new List<string>();
        string searchFor = ">Fixed_String</a>";

        for (int i = html.IndexOf(searchFor); i >= 0; i = html.IndexOf(searchFor, i + searchFor.Length))
        {
            string href = ExtractHref(html, i);
            if (!String.IsNullOrEmpty(href))
                links.Add(href);
        }

        return links;
    }

    private string ExtractHref(string html, int backtrackFrom)
    {
        int hrefStart = -1;

        // Find "<a", but limit search so we don't backtrack forever
        for (int i = backtrackFrom; i > backtrackFrom - 255; i--)
        {
            if (i < 0)
                return null;

            if (html[i] == '<' && html[i + 1] == 'a')
            {
                hrefStart = html.IndexOf("href=", i);
                break;
            }
        }

        if (hrefStart < 0)
            return null;

        int start = html.IndexOfAny(quotes, hrefStart);
        if (start < 0)
            return null;

        int end = html.IndexOfAny(quotes, start + 1);
        if (end < 0)
            return null;

        return html.Substring(start + 1, end - start - 1);
    }

XmlReader is probably a no-go, because you most likely cannot guarantee that those files are XHTML formatted. If you want to do proper parsing, the HTML Agility Pack is probably your best choice, or maybe a properly done Regex if it can't be helped. I posted this manual parsing so you have another alternative you can do a performance test with.


Generally Regex is faster with small files. If file size gets bigger (>~60Kb in my experience) then Regex becomes slower (Even static, compiled etc.). Found exact situation described in very good English:

Stripping Out Empty XmlElements in a Performant Way and the Bus Factor

Have fun discovering what is "High Bus Factor". It brought me a good mood for a day.


I think in this case where there are strings which are large enough on average and which contains zero or more substrings the best way will be to use Regex class like this:

string anchorPattern = @"<(.|/)a(.|\n)+?>";

foreach (string str in strings)
{
    Regex regex = new Regex(anchorPattern);

    foreach (Match match in regex.Matches(str))
    {
         // do here what you want with substring in match.Value
    }

}


From benchmarking, the optimal way of generating a substring is using ReadOnlySpans, not using string.Split

string.Split is much slower and writes a lot to memory while Readonly spans only write to heap.

|                  Method |      Mean |    Error |    StdDev |    Median |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------ |----------:|---------:|----------:|----------:|-------:|------:|------:|----------:|
|    SpanParseLongVersion |  17.84 ns | 0.385 ns |  0.674 ns |         - |      - |     - |     - |         - |
| ParseLongFWVersionSplit |  95.74 ns | 1.928 ns |  3.274 ns |  95.05 ns | 0.0373 |     - |     - |     176 B |
public const string FWLongVersion= "FWabcdefghijklmnopqrstuvwxyz-1.0000000000001";

[Benchmark]
public void SpanParseLongVersion()
{
    var dashChar = '-';
    var vSpan = FWLongVersion.AsSpan();
    var length = FWLongVersion.Length;

    ReadOnlySpan<char> fwVersion = null;

    for (int x = 0; x < length; x++)
    {
        if (vSpan[x] == dashChar)
        {
            fwVersion = vSpan.Slice(x + 1, 5);
            break;
        }
    }
}

[Benchmark]
public void ParseLongFWVersionSplit()
{
    var x = FWLongVersion.Split('-');
}

also, using string.StartsWith and string.Contains is not great....for how to efficiently check this, please check my post here: https://stackoverflow.com/a/64395744/4926590

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜