开发者

How do I store gzipped files using PigStorage in Apache Pig?

Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:

MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);

I can process that data and output it to disk okay:

PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');

But the output file isn't compressed:

/tmp/usercount/part-r-00000

Is there a way of telling the STORE command to output content in gzip format? Note tha开发者_StackOverflowt ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.


There are two ways:

  1. As mentioned above in the storage you can say the output directory as

    usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

  2. Set compression method in your script.

    set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;


For Pig r0.8.0 the answer is as simple as giving your output path an extension of ".gz" (or ".bz" should you prefer bzip).

The last line of your code should be amended to read:

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

Per your example, your output file would then be found as

/tmp/usercount.gz/part-r-00000.gz

For more information, see: https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#PigStorage


According to the Pig documentation for PigStorage, there are 2 ways to do this

Specifying the compression format using the 'STORE' statement

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.bz2' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.lzo' USING PigStorage(',');

Notice the above statements. Pig supports 3 compression formats, i.e GZip, BZip2 and LZO. For getting LZO to work you have to install it separately. See here for more information about lzo.

Specifying compression via job properties

By setting the following properties in your pig script, i.e output.compression.enabled and output.compression.codec via the following code

set output.compression.enabled true;

and

set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜