Pig Load problem with multiple delimiters
I have some data log lines like
Sep 10 12:00:01 10.100.2.28 t: |US,en,5,7350,100,0.076241,0.开发者_Go百科105342,-1,0,1,5,2,14,,,0,5134,7f378ecef7,fec81ebe-468a-4ac7-b472-8bd1ee88bfc2
Sep 10 12:00:01 10.100.2.28 t: |US,en,3,22427,100,0.05816,0.04018,-1,0,1,15,15,0,24383,cyclops.untd.com/,0,2796,2c5de71073,4858b748-121a-4f60-8087-97a8527d57c6
Sep 10 12:00:01 10.100.2.28 t: |us,en,6,16839,100,-1,-1,-1,17,1,0,-1,0,13819,d.tradex.openx.com/,0,-1,,4f805e3b-86b7-4dee-ae68-24e726cde954
No as it is evident there are two delimiters (comma and space) .. While using the PigStorage function, I think I can only use one of them .... That leaves me with chararray of the other string with the other delimiter (space or comma).
I want to access each member of that chararray but cannot do so. I have also tried TOKENIZE but that gives a bag and I don't think items in a bag are ordered and thus can be accessed individually ...
Monks any help would be greatly appreciated ...
Tanuj
You can write your own custom user-defined load function that can handle the loading in any way you want. Usually, if your format is some sort of weird custom format, you are going to be stuck doing this. You can also get the nice feature of having your custom loader automatically name the columns.
Your other option would be to preprocess your data before it gets into Pig to be nicely delimited. I'm not sure how your data is set up or how it is coming in, so I'm not sure if this is possible. In general, a little data grooming and sanitization is never a bad thing.
Simplest solution I can think of would be to use the built in PigStorage loader for one of the two delimiters then STRSPLIT to get the other one.
Example (assuming there's 19 comma separated fields since that's what it looked like):
A = LOAD 'myData' USING PigStorage(' ') AS
(date:chararray,restOfCommaDelimitedFields:chararray);
B = FOREACH A GENERATE date, FLATTEN(STRSPLIT(restOfCommaDelimitedFields,19)) AS
(country,language,field3,field4...etc);
Note this would break if there were spaces between any of your comma delimited fields.
write you own UDF, it will be the best way to solve your problem
精彩评论