开发者

solving a problem with map reduce

I want to simulate in ruby my implementation of the map and reduce functions for a system like hadoop to verify that the idea works at least.

I have the following problem. I have two lists of e开发者_JAVA百科lements:

List1
3 - A
4 - B
5 - C
7 - D
8 - F

List2
2 - A
8 - B
6 - C
9 - D
4 - E

I need to build a common list that includes the sum of the numbers associated with the alphabets common in the two lists:

commonList
5 - A
12 - B
11 - C
16 - D

I want to make a ruby script with the map and reduce operations to solve this problem. I am unsure how to tackle on this problem or what procedure to follow to simulate this in a ruby script.

Any help appreciated.


Using irb (ruby-1.9.2-p180):

list = [ {a:2, b:1, d:3}, {a:3, b:2, c:3}, {a:4, b:1, c:3} ]
 => [{:a=>2, :b=>1, :d=>3}, {:a=>3, :b=>2, :c=>3}, {:a=>4, :b=>1, :c=>3}] 

Hash[list.map(&:keys).inject(&:&).map{|key| [key,list.map{|arr| arr[key]}.inject(&:+)]}]
 => {:a=>9, :b=>4} 

this solution works with multiple arrays (2+) it finds common keys and sums them returning a hash of results

to find common keys (gather keys and find common part):

list.map(&:keys).inject(&:&)

to find sum for key (select values by keys and sum them):

list.map{|arr| arr[key]}.inject(&:+)

to build Hash from array of pairs [[:a,9], [:b,4]]:

results = [[:a,9], [:b,4]]
Hash[ results ]

I love ruby for this one liners!


You could try by considering the elements given in MapReduce wikipedia article:

  • an input reader - in your case this would probably be a method call on [key, value] pair from your input hashes.
  • a Map function - you already have keys you should be processing your data by, so your map worker would just return the [key, value] pair it got as an input
  • a partition function - a method which would assign a reduce worker based on the key. In your case it could be simply key.hash % REDUCER_COUNT.
  • a compare function - I don't think this is applicable in your case as you don't need values to be processed in any particular order.
  • a Reduce function - would be given [key, list] pair, list being list of values associated with the key. It would return the sum of list if list is more than one element long (as you want only elements appearing in both input hashes processed).
  • an output writer - could be plain Hash in your example.

And here's my (over)simplified implementation of the above.


Assuming that we have all the other map-reduce-related functions implemented (input reader, output writer, global sort, ...), these would be the map and reduce ones:

def map(input)
  input.each do |count, letter|
    yield [letter, count]
  end
end

def reduce(letter, partial_counts)
  result = if partial_counts.size == 2
    partial_counts[0] + partial_counts[1]
  end

  yield result
end

The map function will yield a pair (letter, count), which will be grouped later. Then for each letter received from maps reducewill receive an array containing every count yielded by a map for that letter. As you wish to only yield if the letter occurs on both hashes, we need counts to appear on partial_counts twice to use it to compute the sum at the end. The reduce function could be implemented in several ways. I've tried to make it as simples as possible to understand, although it's implementation is very adjusted to this problem.

Using these map and reduce implementation will return the last hash with keys and value inverted, which makes more sense, as there may be several letter with the same count. The input would be better if it inverted keys and values too. This way, map would be as simple as yielding each pair of (letter, count):

def map(input)
  input.each do |letter, count|
    yield [letter, count]
  end
end

or

def map(input)
  input.each do |i|
    yield i
  end
end


list_1 = ["3 - A", "4 - B", "5 - C", "7 - D", "8 - F"]

list_2 = ["2 - A", "8 - B", "6 - C", "9 - D", "4 - E"]

(list_1 + list_2).map do |str|
  # change array of strings to array in the form of [[name, value], ...]
  str  =~ /(\d+) - (.*)/ && [$2, $1.to_i]
end.reduce({}) do |memo, obj|
  # use a temporary Hash to sum up the values;
  # the value is an array in the form of [value_counter, iteration_counter]
  prev = memo[obj.first] || [0, 0]
  memo[obj.first] = [prev.first + obj.last, prev.last + 1]
  memo
end.map do |key, value|
  # convert to array in original format or
  # nil, if occurred only once
  value.last > 1 ? "#{key} - #{value.first}" : nil
end.compact

=> ["A - 5", "B - 12", "C - 11", "D - 16"]

This code uses the map and reduce methods of Ruby, but doing all this directly on a Hash would be much more elegant.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜