I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, whi开发者_开发技巧ch opens t
I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing.
Apache Pig can load data from Hadoop seq开发者_运维问答uence files using the PiggyBank SequenceFileLoader:
I have a pig script, that activates another python program. I was able to do so in my own hadoop environment, but I always fail when I run my script in Amazon map reduce WS.