How should I partition data in s3 for use with hadoop hive?

2023-01-30 14:21 问答作者：

I have a s3 bucket containing about 300gb of log files in no particular order.

I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:

s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3

etc

What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?

I'm sure there's a good way to do this using hadoop, could someone tell me what that is?

What I've tried:

I tried using hadoop-streaming by passing开发者_StackOverflow中文版 in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)

Does anyone have any ideas how best to approach this?

If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.

Why not create an external table over this data, then use hive to create the new table?

create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;

In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.

继续阅读：hive mapreduce

How should I partition data in s3 for use with hadoop hive?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？