POST Hadoop Pig output to a URL as JSON data?
I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.
Some notes:
- This job is running on Amazon Elastic MapReduce.
- I can use a STREAM to pipe the data through an external command, and load it from there. But because P开发者_开发百科ig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.
What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!
Rather than streaming you could write a UDF (since UDF's do provide a finish() callback) [1]
Another approach could be to do the POST as a second pass over the data.
- your existing pig step that just writes out to a single relation as json strings
- a simple streaming job using NLineInputFormat to do the POST in batchs
I always favor this style of approach since it seperates the concerns and makes the pig code clean.
It also allows you (in my mind) simpler tuning options on the POST portion of your job. In this case it's (probably) important for you to turn off speculative execution depending on the idempotence of your receiving webservice. Beware that your cluster running lots of concurrent jobs can totally kill a server too :D
eg for posting in batches of 20...
$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \ -D mapred.line.input.format.linespermap=20 \ -D mapred.reduce.tasks.speculative.execution=false \ -input json_data_to_be_posted -output output \ -mapper your_posting_script_here.sh \ -numReduceTasks 0 \ -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29
Perhaps you should handle the posting of the data outside of Pig. I find that wrapping my Pig in bash usually is easier than doing some UDF of a post (no pun intended) processing step. If you never want it hitting S3, you can use dump
instead of store
and handle the standard out to be posted. Otherwise, store it in S3, pull it out with hadoop fs -cat outputpath/part*
then send it out with curl
or something.
As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.
Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:
DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;
Stream the results through your script:
/* Write our results to our Ruby script for uploading. We add
a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;
From Ruby, you can batch the input records into blocks of 1024:
STDIN.each_line.each_slice(1024) do |chunk|
# 'chunk' is an array of 1024 lines, each consisting of tab-separated
# fields followed by a newline.
end
If this fails to work, check the following carefully:
- Does your script work from the command line?
- When run from Pig, does your script have all the necessary environment variables?
- Are your EC2 bootstrap actions working correctly?
Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.
Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.
精彩评论