How do I store gzipped files using PigStorage in Apache Pig?
Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:
MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);
I can process that data and output it to disk okay:
PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');
But the output file isn't compressed:
/tmp/usercount/part-r-00000
Is there a way of telling the STORE
command to output content in gzip format? Note tha开发者_StackOverflowt ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.
There are two ways:
As mentioned above in the storage you can say the output directory as
usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
Set compression method in your script.
set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
For Pig r0.8.0 the answer is as simple as giving your output path an extension of ".gz" (or ".bz" should you prefer bzip).
The last line of your code should be amended to read:
STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
Per your example, your output file would then be found as
/tmp/usercount.gz/part-r-00000.gz
For more information, see: https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#PigStorage
According to the Pig documentation for PigStorage, there are 2 ways to do this
Specifying the compression format using the 'STORE' statement
STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.bz2' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.lzo' USING PigStorage(',');
Notice the above statements. Pig supports 3 compression formats, i.e GZip, BZip2 and LZO. For getting LZO to work you have to install it separately. See here for more information about lzo.
Specifying compression via job properties
By setting the following properties in your pig script, i.e output.compression.enabled
and output.compression.codec
via the following code
set output.compression.enabled true;
and
set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
精彩评论