POST Hadoop Pig output to a URL as JSON data?

2023-03-16 02:44 问答作者：

I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.

Some notes:

This job is running on Amazon Elastic MapReduce.
I can use a STREAM to pipe the data through an external command, and load it from there. But because P开发者_开发百科ig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.

What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!

Rather than streaming you could write a UDF (since UDF's do provide a finish() callback) [1]

Another approach could be to do the POST as a second pass over the data.

your existing pig step that just writes out to a single relation as json strings
a simple streaming job using NLineInputFormat to do the POST in batchs

I always favor this style of approach since it seperates the concerns and makes the pig code clean.

It also allows you (in my mind) simpler tuning options on the POST portion of your job. In this case it's (probably) important for you to turn off speculative execution depending on the idempotence of your receiving webservice. Beware that your cluster running lots of concurrent jobs can totally kill a server too :D

eg for posting in batches of 20...

$ hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -D mapred.line.input.format.linespermap=20 \
  -D mapred.reduce.tasks.speculative.execution=false \
  -input json_data_to_be_posted -output output \
  -mapper your_posting_script_here.sh \
  -numReduceTasks 0 \
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

[1] http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/EvalFunc.html#finish%28%29

Perhaps you should handle the posting of the data outside of Pig. I find that wrapping my Pig in bash usually is easier than doing some UDF of a post (no pun intended) processing step. If you never want it hitting S3, you can use dump instead of store and handle the standard out to be posted. Otherwise, store it in S3, pull it out with hadoop fs -cat outputpath/part* then send it out with curl or something.

As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.

Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:

DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;

Stream the results through your script:

/* Write our results to our Ruby script for uploading.  We add
   a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;

From Ruby, you can batch the input records into blocks of 1024:

STDIN.each_line.each_slice(1024) do |chunk|
  # 'chunk' is an array of 1024 lines, each consisting of tab-separated
  # fields followed by a newline. 
end

If this fails to work, check the following carefully:

Does your script work from the command line?
When run from Pig, does your script have all the necessary environment variables?
Are your EC2 bootstrap actions working correctly?

Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.

Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.

继续阅读：apache-pig elastic-map-reduce

POST Hadoop Pig output to a URL as JSON data?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？