Improve Efficiency for This Text Processing Code

2023-02-01 02:20 问答作者：

I am writing a program that counts the number of words in a text file which is already in lowercase and separated by spaces. I want to use a dictionary and only count the word IF it's within the dictionary. The problem is the dictionary is quite large (~100,000 words) and each text document has also ~50,000 words. As such, the codes that I wrote below gets very slow (takes about 15 sec to process one document on a quad i7 machine). I'm wondering if there's something wrong with my coding and if the efficiency of the program can be improved. Thanks so much for your help. Code below:

public static string WordCount(string countInput)
        {
            string[] keywords = ReadDic(); /* read dictionary txt file*/

            /*then reads the main text file开发者_JAVA技巧*/
            Dictionary<string, int> dict = ReadFile(countInput).Split(' ')
                .Select(c => c)
                .Where(c => keywords.Contains(c))
                .GroupBy(c => c)
                .Select(g => new { word = g.Key, count = g.Count() })
                .OrderBy(g => g.word)
                .ToDictionary(d => d.word, d => d.count);

            int s = dict.Sum(e => e.Value);
            string k = s.ToString();
            return k;

        }

You can vastly improve performance by reading the text file one line at a time instead of building an enormous string.

You can call

File.ReadLines(path).SelectMany(s => s.Split(' '))

Do not call ReadAllLines; it will need to build an enormous array.

Your first Select call is utterly useless.

Your Contains call will loop through the entire dictionary for each word in the file.
Thus, the Where call is an O(n²) operation.

Change keywords to a HashSet<string>.
Since HashSets can be searched in constant time, the Where call will become an O(n) operation, which is much better.

Your second Select call can be combined with the GroupBy, which will cut a large number of object allocations:

 .GroupBy(c => c, (word, set) => new { word, count = set.Count() })

Dictionaries are intrisically unordered, so your OrderBy call is a useless waste of time.

Since you're running quad-core, you could at least throw in an AsParallel() in there.

As I can see all your code can be replaced with

return ReadFile(countInput).Split(' ').Count(c => keywords.Contains(c));

And, as SLaks said, HashSet - would improve performance.
One more improvement: if you call this code in loop, the you shouldn't read ReadDic() on each iteration - load it once and pass as a parameter.

Try changing string[] keywords to a HashSet<string> keywords. Your call to "Contains" is essentially a loop, which is going to be vastly slower than a lookup by hash key.

If you want to get REALLY fancy, you could make use of multiple threads by using some PLINQ, but I would make sure you've optimized your single thread performance before going that route.

继续阅读：text

Improve Efficiency for This Text Processing Code

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？