Parsing and loading into Hive/Hadoop
i am new to hadoop map reduce framework, and I am thinking of using hadoop map reduce to parse my data. I have thousands of big delimited files for which I am thinking of writing a map reduce job to parse those files and load them into hive datawarehouse. I have written a parser in perl which can parse those files. But I am stuck at doing the same with Hadoop map reduce
For example: I have a file like x=a y=b z=c..... x=p y=q z=s..... x=1 z=2 .... and so on
Now I have to load this file as columns (x,y,z) in hive table, but I am not able to figure out can I proceed with it. Any guidance with this would be really helpful.
Another problem in doing this is there are some files where the field y is missing. I have to include that condition in the map reduce job. So far, I have tried using streaming.jar and giving my parser.pl as mapper as input to that jar file. I think that is not the way to do it :), but I was just trying if that would work. Also, I thought of using load function of Hive, but the missing column will create problem if I will specify regexserde in hive table.
开发者_JAVA百科I am lost in this now, if any one could guide me with this I would be thankful :)
Regards, Atul
I posted something a while ago to my blog a while ago. (Google "hive parse_url" should be in the top few)
I was parsing urls but in this case you will want to use str_to_map
.
str_to_map(arg1, arg2, arg3)
arg1
=> String to processarg2
=> Key Value Pair separatorarg3
=> Key Value separator
str = "a=1 b=42 x=abc"
str_to_map(str, " ", "=")
The result of str_to_map
will give you a map<str, str>
of 3 key-value pairs.
str_to_map(str, " ", "=")["a"] --will return "1"
str_to_map(str, " ", "=")["b"] --will return "42"
We can pass this to Hive via:
INSERT OVERWRITE TABLE new_table_with_cols_x_y_z
(select params["x"], params["y"], params["z"]
from (
select str_to_map(raw_line," ","=") as params from data
) raw_line_from_data
) final_data
精彩评论