开发者

Specifying compression codec for a INSERT OVERWRITE SELECT in Hive

I have a hive table like

  CREATE TABLE beacons
 (
     foo string,
     bar string,
     foonotbar string
 )
 COMMENT "Digest of daily beacons, by day"
 PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

To populate, I am doing something like:

 SET hive.exec.compress.output=True;
 SET io.seqfile.compression.type=BLOCK;

 INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
   someFunc(query, "foo") as foo,
   someFunc(query, "bar") as bar,
   otherFunc(query, "foo||bar") as foonotbar
   )
  FROM raw_logs开发者_运维问答
WHERE day = "2011-01-26";

This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.

Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.


Before the INSERT OVERWRITE prepend with the following runtime configuration values:

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

Also make sure you have the desired compression codec by checking:

io.compression.codecs

Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage

I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜