开发者

Number of results needed from each node in parallel processing

I have a dataset (in the form of a file) composed of lines of words. I want to find the 20 more frequently occurring words. This is a huge dataset so I am processing this in parallel. Now I would partition the dataset into subsets and feed them into each parallel worker. Each parallel worker woul开发者_开发技巧d then find the counts of each word and return a list of the most frequent words with their counts. Then all the lists would be aggregated and the top 20 most frequent words out of the entire dataset would be compiled from the results of each worker.

How many word/count pairs does each worker need to return to the aggregator in order to guarantee that I will get the top 20 words out of the entire data set?


You need to process all of the words until the difference between the 20th and 21st most frequent words is greater than the number of unprocessed words remaining.

If you need to rank the top 20 most frequent words, then you need to process everything.


It's impossible to say. Consider, for example, the possibility that you have four workers and worker 1's word frequencies are the inverse of the other three workers' frequencies. In order to get an accurate count, worker 1 will have to return its entire data set so that it can be aggregated.

In the general case, each worker has to return all of its data to the master node for aggregation.

What you're doing is a pretty typical map and reduce problem, a good discussion of which can be found at MapReduce or Map/Reduce - A visual explanation.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜