Given a file, find the ten most frequently occurring words as efficiently as possible
Thi开发者_如何学JAVAs is apparently an interview question (found it in a collection of interview questions), but even if it's not it's pretty cool.
We are told to do this efficiently on all complexity measures. I thought of creating a HashMap that maps the words to their frequency. That would be O(n) in time and space complexity, but since there may be lots of words we cannot assume that we can store everything in memory.
I must add that nothing in the question says that the words cannot be stored in memory, but what if that were the case? If that's not the case, then the question does not seem as challenging.
Optimizing for my own time:
sort file | uniq -c | sort -nr | head -10
Possibly followed by awk '{print $2}'
to eliminate the counts.
I think the trie data structure is a choice.
In the trie, you can record word count in each node representing frequency of word consisting of characters on the path from root to current node.
The time complexity to setup the trie is O(Ln) ~ O(n) (where L is number of characters in the longest word, which we can treat as a constant). To find the top 10 words, we can traversal the trie, which also costs O(n). So it takes O(n) to solve this problem.
An complete solution would be something like this:
- Do an external sort O(N log N)
- Count the word freq in the file O(N)
- (An alternate would be the use of a Trie as @Summer_More_More_Tea to count the frequencies, if you can afford that amount of memory) O(k*N) //for the two first steps
- Use a min-heap:
- Put the first n elements on the heap
- For every word left add it to the heap and delete the new min in heap
- In the end the heap Will contain the n-th most common words O(|words|*log(n))
With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.
I have done in C# like this(a sample)
int wordFrequency = 10;
string words = "hello how r u u u u u u u u u u u u u u u u u u ? hello there u u u u ! great to c u there. hello .hello hello hello hello hello .hello hello hello hello hello hello ";
var result = (from word in words.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)
group word by word into g
select new { Word = g.Key, Occurance = g.Count() }).ToList().FindAll(i => i.Occurance >= wordFrequency);
Let's say we assign a random prime number to each of the 26 alphabets. Then we scan the file. Whenever we find a word, we calculate its hash value(formula based on the positon & the value of the alphabets making the word). If we find this value in the hash table, then we know for sure that we are not encountering it for the first time and we increment its key value. And maintain a array of maximum 10. But If we encounter a new hash , then we store the file pointer for that hash value, and initialize the key to 0.
I think this is a typical application of counting sort since the sum of occurrences of each word is equal to the total number of words. A hash table with a counting sort should do the job in a time proportional to the number of words.
You could make a time/space tradeoff and go O(n^2)
for time and O(1)
for (memory) space by counting how many times a word occurs each time you encounter it in a linear pass of the data. If the count is above the top 10 found so far, then keep the word and the count, otherwise ignore it.
Says building a Hash and sorting the values is best. I'm inclined to agree. http://www.allinterview.com/showanswers/56657.html
Here is a Bash implementation that does something similar...I think http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file
Depending on the size of the input data, it may or may not be a good idea to keep a HashMap. Say for instance, our hash-map is too big to fit into main memory. This can cause a very high number of memory transfers as most hash-map implementations need random access and would not be very good on the cache.
In such cases sorting the input data would be a better solution.
Cycle through the string of words and store each in a dictionary(using python) and number of times they occur as the value.
If the word list will not fit in memory, you can split the file until it will. Generate a histogram of each part (either sequentially or in parallel), and merge the results (the details of which may be a bit fiddly if you want guaranteed correctness for all inputs, but should not compromise the O(n) effort, or the O(n/k) time for k tasks).
A Radix tree or one of it's variations will generally allow you to save storage space by collapsing common sequences.
Building it will take O(nk) - where k is "the maximum length of all strings in the set".
step 1 : If the file is very large and can't be sorted in memory you can split it into chunks that can be sorted in memory.
step 2 : For each sorted chunk compute sorted pairs of (words, nr_occurrence), at his point you can renounce to the chunks because you need only the sorted pairs.
step 3 : Iterate over the chunks and sort the chunks and always keep the top ten appearances.
Example:
Step 1:
a b a ab abb a a b b c c ab ab
split into :
chunk 1: a b a ab
chunk 2: abb a a b b
chunk 3: c c ab ab
Step 2:
chunk 1: a2, b1, ab1
chunk 2: a2, b2, abb1
chunk 3: c2, ab2
Step 3(merge the chunks and keep the top ten appearances):
a4 b3 ab3 c2 abb1
int k = 0;
int n = i;
int j;
string[] stringList = h.Split(" ".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
int m = stringList.Count();
for (j = 0; j < m; j++)
{
int c = 0;
for (k = 0; k < m; k++)
{
if (string.Compare(stringList[j], stringList[k]) == 0)
{
c = c + 1;
}
}
}
Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head
Loop over each line with -n
Split each line into @F
words with -a
Each $_
word increments hash %h
Once the END
of file
has been reached,
sort
the hash by the frequency
Print the frequency $h{$w}
and the word $w
Use bash head
to stop at 10 lines
Using the text of this web page as input:
121 the
77 a
48 in
46 to
44 of
39 at
33 is
30 vote
29 and
25 you
I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.
精彩评论