Specifying compression codec for a INSERT OVERWRITE SELECT in Hive
I have a hive table like
CREATE TABLE beacons
(
foo string,
bar string,
foonotbar string
)
COMMENT "Digest of daily beacons, by day"
PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );
To populate, I am doing something like:
SET hive.exec.compress.output=True;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
someFunc(query, "foo") as foo,
someFunc(query, "bar") as bar,
otherFunc(query, "foo||bar") as foonotbar
)
FROM raw_logs开发者_运维问答
WHERE day = "2011-01-26";
This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.
Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.
Before the INSERT OVERWRITE prepend with the following runtime configuration values:
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;
Also make sure you have the desired compression codec by checking:
io.compression.codecs
Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage
I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.
精彩评论