Minimize LINQ string token counter

2023-01-22 03:32 问答作者：

Followup on answer to an earlier question.

Is there a way to further reduce this, avoiding the external String.Split call? The goal is an associative container of {token, count}.

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

string[] target = src.Split(new char[] { ' ' });

var results = target.开发者_运维问答GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

As you have it right now, it will work (to some extent) but is terribly inefficient. As is, the result is an enumeration of groupings, not the (word, count) pairs you might be thinking.

That overload of GroupBy() takes a function to select the key. You are effectively performing that calculation for every item in the collection. Without going the route of using regular expressions ignoring punctuation, it should be written like so:

string src = "for each character in the string, take the rest of the " +
             "string starting from that character " +
             "as a substring; count it if it starts with the target string";

var results = src.Split()               // default split by whitespace
                 .GroupBy(str => str)   // group words by the value
                 .Select(g => new
                              {
                                  str = g.Key,      // the value
                                  count = g.Count() // the count of that value
                              });

// sort the results by the words that were counted
var sortedResults = results.OrderByDescending(p => p.str);

While 3-4 times slower, the Regex method is arguably more accurate:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";

var regex=new Regex(@"\w+",RegexOptions.Compiled);
var sw=new Stopwatch();

for (int i = 0; i < 100000; i++)
{
    var dic=regex
        .Matches(src)
        .Cast<Match>()
        .Select(m=>m.Value)
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

sw.Reset();

for (int i = 0; i < 100000; i++)
{
    var dic=src
        .Split(' ')
        .GroupBy(s=>s)
        .ToDictionary(g=>g.Key,g=>g.Count());
    if(i==1000)sw.Start();
}
Console.WriteLine(sw.Elapsed);

For instance, the Regex method won't count string and string, as two separate entries, and will correctly tokenise substring instead of substring;.

EDIT

Read your previous question and realise my code doesn't quite conform to your spec. Regardless, it still demonstrates the advantage/cost of using Regex.

Here's a LINQ version without ToDictionary(), which may add unnecessary overhead depending on your needs...

var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() });

Or in query syntax...

var dic = from str in src.Split(' ')
          group str by str into g
          select new { str, count = g.Count() };

Getting rid of String.Split doesn't leave many options on the table. One option is Regex.Matches as spender demonstrated, and another is Regex.Split (which doesn't give us anything new).

Rather than grouping you could use either of these approaches:

var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries);
var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) });

// or dictionary approach
var result = target.Distinct()
                   .ToDictionary(s => s, s => target.Count(w => w == s));

The Distinct call is needed to avoid duplicate items. I went ahead and expanded the characters to split on to get the actual words devoid of punctuation. I found the first approach to be the quickest using spender's benchmarking code.

Back to the requirement to order the results from your previously referenced question, you could easily extend the first approach as follows:

var result = target.Distinct()
                   .Select(s => new { Word = s, Count = target.Count(w => w == s) })
                   .OrderByDescending(o => o.Count);

// or in query form

var result = from s in target.Distinct()
             let count = target.Count(w => w == s)
             orderby count descending
             select new { Word = s, Count = count };

EDIT: got rid of the Tuple since the anonymous type was close at hand.

继续阅读：.net linq string

Minimize LINQ string token counter

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？