Parsing and loading into Hive/Hadoop

2023-03-17 14:04 问答作者：

i am new to hadoop map reduce framework, and I am thinking of using hadoop map reduce to parse my data. I have thousands of big delimited files for which I am thinking of writing a map reduce job to parse those files and load them into hive datawarehouse. I have written a parser in perl which can parse those files. But I am stuck at doing the same with Hadoop map reduce

For example: I have a file like x=a y=b z=c..... x=p y=q z=s..... x=1 z=2 .... and so on

Now I have to load this file as columns (x,y,z) in hive table, but I am not able to figure out can I proceed with it. Any guidance with this would be really helpful.

Another problem in doing this is there are some files where the field y is missing. I have to include that condition in the map reduce job. So far, I have tried using streaming.jar and giving my parser.pl as mapper as input to that jar file. I think that is not the way to do it :), but I was just trying if that would work. Also, I thought of using load function of Hive, but the missing column will create problem if I will specify regexserde in hive table.

开发者_JAVA百科

I am lost in this now, if any one could guide me with this I would be thankful :)

Regards, Atul

I posted something a while ago to my blog a while ago. (Google "hive parse_url" should be in the top few)

I was parsing urls but in this case you will want to use str_to_map.

str_to_map(arg1, arg2, arg3)

arg1 => String to process
arg2 => Key Value Pair separator
arg3 => Key Value separator

str = "a=1 b=42 x=abc"
str_to_map(str, " ", "=")

The result of str_to_map will give you a map<str, str> of 3 key-value pairs.

str_to_map(str, " ", "=")["a"] --will return "1"

str_to_map(str, " ", "=")["b"] --will return "42"

We can pass this to Hive via:

INSERT OVERWRITE TABLE new_table_with_cols_x_y_z
(select params["x"], params["y"], params["z"] 
 from (
   select str_to_map(raw_line," ","=") as params from data
 ) raw_line_from_data
) final_data

继续阅读：hive mapreduce

Parsing and loading into Hive/Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？