solving a problem with map reduce
I want to simulate in ruby my implementation of the map
and reduce
functions for a system like hadoop to verify that the idea works at least.
I have the following problem. I have two lists of e开发者_JAVA百科lements:
List1
3 - A
4 - B
5 - C
7 - D
8 - F
List2
2 - A
8 - B
6 - C
9 - D
4 - E
I need to build a common list that includes the sum of the numbers associated with the alphabets common in the two lists:
commonList
5 - A
12 - B
11 - C
16 - D
I want to make a ruby script with the map
and reduce
operations to solve this problem. I am unsure how to tackle on this problem or what procedure to follow to simulate this in a ruby script.
Any help appreciated.
Using irb (ruby-1.9.2-p180):
list = [ {a:2, b:1, d:3}, {a:3, b:2, c:3}, {a:4, b:1, c:3} ]
=> [{:a=>2, :b=>1, :d=>3}, {:a=>3, :b=>2, :c=>3}, {:a=>4, :b=>1, :c=>3}]
Hash[list.map(&:keys).inject(&:&).map{|key| [key,list.map{|arr| arr[key]}.inject(&:+)]}]
=> {:a=>9, :b=>4}
this solution works with multiple arrays (2+) it finds common keys and sums them returning a hash of results
to find common keys (gather keys and find common part):
list.map(&:keys).inject(&:&)
to find sum for key (select values by keys and sum them):
list.map{|arr| arr[key]}.inject(&:+)
to build Hash from array of pairs [[:a,9], [:b,4]]
:
results = [[:a,9], [:b,4]]
Hash[ results ]
I love ruby for this one liners!
You could try by considering the elements given in MapReduce wikipedia article:
- an input reader - in your case this would probably be a method call on
[key, value]
pair from your input hashes. - a Map function - you already have keys you should be processing your data by, so your
map
worker would just return the[key, value]
pair it got as an input - a partition function - a method which would assign a reduce worker based on the key. In your case it could be simply
key.hash % REDUCER_COUNT
. - a compare function - I don't think this is applicable in your case as you don't need values to be processed in any particular order.
- a Reduce function - would be given
[key, list]
pair, list being list of values associated with the key. It would return the sum oflist
if list is more than one element long (as you want only elements appearing in both input hashes processed). - an output writer - could be plain Hash in your example.
And here's my (over)simplified implementation of the above.
Assuming that we have all the other map-reduce-related functions implemented (input reader, output writer, global sort, ...), these would be the map
and reduce
ones:
def map(input)
input.each do |count, letter|
yield [letter, count]
end
end
def reduce(letter, partial_counts)
result = if partial_counts.size == 2
partial_counts[0] + partial_counts[1]
end
yield result
end
The map
function will yield
a pair (letter, count)
, which will be grouped later. Then for each letter
received from map
s reduce
will receive an array containing every count yielded by a map
for that letter
. As you wish to only yield if the letter occurs on both hashes, we need count
s to appear on partial_counts
twice to use it to compute the sum at the end. The reduce
function could be implemented in several ways. I've tried to make it as simples as possible to understand, although it's implementation is very adjusted to this problem.
Using these map
and reduce
implementation will return the last hash with keys and value inverted, which makes more sense, as there may be several letter with the same count. The input would be better if it inverted keys and values too. This way, map
would be as simple as yielding each pair of (letter, count)
:
def map(input)
input.each do |letter, count|
yield [letter, count]
end
end
or
def map(input)
input.each do |i|
yield i
end
end
list_1 = ["3 - A", "4 - B", "5 - C", "7 - D", "8 - F"]
list_2 = ["2 - A", "8 - B", "6 - C", "9 - D", "4 - E"]
(list_1 + list_2).map do |str|
# change array of strings to array in the form of [[name, value], ...]
str =~ /(\d+) - (.*)/ && [$2, $1.to_i]
end.reduce({}) do |memo, obj|
# use a temporary Hash to sum up the values;
# the value is an array in the form of [value_counter, iteration_counter]
prev = memo[obj.first] || [0, 0]
memo[obj.first] = [prev.first + obj.last, prev.last + 1]
memo
end.map do |key, value|
# convert to array in original format or
# nil, if occurred only once
value.last > 1 ? "#{key} - #{value.first}" : nil
end.compact
=> ["A - 5", "B - 12", "C - 11", "D - 16"]
This code uses the map
and reduce
methods of Ruby, but doing all this directly on a Hash would be much more elegant.
精彩评论