Combiner function in python hadoop streaming
I 开发者_JS百科have a mapper that outputs key and value , which is sorted and piped into reducer.py ,
As the keys are already sorted, before I get to the reducer , I want to write a combiner which iterates through the sorted list and outputs key , [ v1,v2,v3] pair which will be used in reducer.
cat data | python mapper.py | sort | python reducer.py
What's the best mechanism to write a reducer so that I won't use dictionary containing all keys, lot of memory to hold the entries in dictionary.
Use itertools.groupby
:
>>> import itertools
>>> import operator
>>> foo = [("a", 1), ("a", 2), ("b", 1), ("c", 1), ("c", 2)]
>>> for group in itertools.groupby(foo, operator.itemgetter(0)):
... print group[0], list(map(operator.itemgetter(1), group[1]))
...
a [1, 2]
b [1]
c [1, 2]
Explanation:
groupby
, as the name suggest, groups elements of an iterable into chunks based on some key function. That is, it calls keyfunc
on the first element of the iterable, then pulls elements one by one from the iterable until the value of keyfunc
changes, at which point it yields all of the elements it has got so far and starts again from the new key. It is also sensible and does not consume more memory than necessary; once values have been yielded they are no longer held by groupby
.
Here, we group the elements of the input by operator.itemgetter(0)
, which is a useful "toolbox" function which maps x
to x[0]
. In other words, we group by the first element of the tuple, which is a key.
Naturally, you will need to write a custom generator to handle reading the input (from sys.stdin
, probably) and yield them one by one. Fortunately, this is pretty easy, using the yield
keyword.
Note also that this assumes that the keys are sorted. Naturally, if they are not sorted there is nothing you can do: you would need to look until the end of the input to ensure that you have all values for a given key.
精彩评论