Using Pig and Python
Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code",开发者_如何学C does Jython come into the picture? Is it more efficient if it does come into the picture?
Thanks
P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.
Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.
A word counting example looks like this:
@map(produces=['word'])
def split_words(tuple):
# This is called for each line of text
for word in tuple.get(1).split():
yield [word]
def main():
flow = Flow()
input = flow.source(Hfs(TextLine(), 'input.txt'))
output = flow.tsv_sink('output')
# This is the processing pipeline
input | split_words | GroupBy('word') | Count() | output
flow.run()
When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep
or a C program.
You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.
The Programming Pig book discusses using UDFs. The book is indispensable in general. On a recent project, we used Python UDFs and occasionally had issues with Floats vs. Doubles mismatches, so be warned. My impression is that the support for Python UDFs may not be as solid as the support for Java UDFs, but overall, it works pretty well.
精彩评论