LINQ method for adding items to a dictionary

2022-12-17 15:30 问答作者：

I'm trying to learn a bit more about LINQ by implementing Peter Norvig's spelling corrector in C#.

The first part involves taking a large file of words (about 1 million) and putting it into a dictionary where the key is the word and the value is the number of occurrences.

I'd normally do this like so:

foreach (var word in allWords)                                                    
{           
    if (wordCount.ContainsKey(word))
        wordCount[word]++;
    else
        wordCount.Add(word, 1);
}

Where allWords is an IEnumerable<string>

In LINQ I'm currently doing it like this:

var wordCountLINQ = (from word in allWordsLINQ
                         group word by word
                         into groups
                         select groups).ToDictionary(g => g.Key, g => g.Count());

I compare the 2 dictionaries by looking at all the <key, value> and they're identical, so they're producing the same results.

The foreach loop takes 3.82 secs and the LINQ query takes 4.49 secs

I'm timing it using the Stopwatch class and I'm running in RELEASE mode. I don't think the performance is bad I was just wondering if there was a reason for the difference.

Am I doing the LINQ query in an inefficient way or am I missing something?

Update: here's the full benchmark code sample:

public static void TestCode()
{
    //File can be downloaded from http://norvig.com/big.txt and consists of about a million words.
    const string fileName = @"path_to_file";
    var allWords = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
                   select m.Value;

    var wordCount = new Dictionary<string, int>();
    var timer = new Stopwatch();            
    timer.Start();
    foreach (var word in allWords)                                                    
    {           
        if (wordCount.ContainsKey(word))
            wordCount[word]++;
        else
            wordCount.Add(word, 1);
    }
    timer.Stop();

    Console.WriteLine("foreach loop took {0:0.00} ms ({1:0.00} secs)\n",
            timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);

    //Make LINQ use a different Enumerable (with the exactly the same values), 
    //if you don't it suddenly becomes way faster, which I assmume is a caching thing??
    var allWordsLINQ = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions开发者_如何学JAVA.Compiled)
                   select m.Value;

    timer.Reset();
    timer.Start();
    var wordCountLINQ = (from word in allWordsLINQ
                            group word by word
                            into groups
                            select groups).ToDictionary(g => g.Key, g => g.Count());  
    timer.Stop();

    Console.WriteLine("LINQ took {0:0.00} ms ({1:0.00} secs)\n",
            timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);                     
}

One of the reasons the LINQ version is slower, is because instead of one dictionary, two dictionaries are created:

(internally) from the group by operator; the group by also stores each individual word. You can verify this by looking at a ToArray() rather than a Count(). This is a lot of overhead you don't actually need in your case.
The ToDictionary method is basically a foreach over the actual LINQ query, where the results from the query are added to a new dictionary. Depending on the number of unique words, this can also take some time.

Another reason that the LINQ query is a little slower, is because LINQ relies on lambda expressions (the delegate in Dathan's answer), and calling a delegate adds a tiny amount of overhead compared to inline code.

Edit: Note that for some LINQ scenarios (such as LINQ to SQL, but not in-memory LINQ such as here), rewriting the query produces a more optimized plan:

from word in allWordsLINQ 
group word by word into groups 
select new { Word = groups.Key, Count = groups.Count() }

Note however, that this doesn't give you a Dictionary, but rather a sequence of words and their counts. You can transform this into a Dictionary with

(from word in allWordsLINQ 
 group word by word into groups 
 select new { Word = groups.Key, Count = groups.Count() })
.ToDictionary(g => g.Word, g => g.Count);

When I build your second example and then open it in Reflector's disassembly view, I get the following:

Dictionary<string, int> wordCountLINQ = allWordsLINQ.GroupBy<string, string>(delegate (string word) {
    return word;
}).Select<IGrouping<string, string>, IGrouping<string, string>>(delegate (IGrouping<string, string> groups) {
    return groups;
}).ToDictionary<IGrouping<string, string>, string, int>(delegate (IGrouping<string, string> g) {
    return g.Key;
}, delegate (IGrouping<string, string> g) {
    return g.Count<string>();
});

Probably it takes longer just because there are more function calls happening, and over the course of a million iterations that adds up.

By completely abusing LINQ I was able to get it to be around the same and often slightly faster than the foreach loop, even with a delegate call:

var wordCountLINQ = allWordsLINQ.Aggregate(new Dictionary<string, int>(), (wcld, w) => { wcld[w] = (wcld.ContainsKey(w) ? wcld[w] : 0) + 1; return wcld; })

Even changing the foreach to use a similar set expression didn't make it faster.

You can solve your problem using lambda expression:

var words = unitOfWork.DepartmentRepository.Get()
           .GroupBy(a=>a.word).Select(s    => new 
           {
             Word = s.Key,
             Count = s.Count()
           }).ToDictionary(d=>d.Word, d=>d.Count);

继续阅读：.net foreach linq performance

LINQ method for adding items to a dictionary

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？