Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?
I'm us开发者_开发知识库ing Amazon's elastic map reduce.
I have log files that look something like this
random text foo="1" more random text foo="2"
more text notamatch="5" noise foo="1"
blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ...
How can I write a pig expression to pick out all the numbers in the 'foo' expressions?
I prefer tuples that look something like this:
(1,2)
(1)
(1,3,4)
I've tried the following:
TUPLES = foreach LINES generate FLATTEN(EXTRACT(line,'foo="([0-9]+)"'));
But this yields only the first match in each line:
(1)
(1)
(1)
You could use STRSPLIT
: http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#STRSPLIT
The regex to split on would be [^0-9]+
(i.e., not numbers)
This will effectively split on large portions of non-numbers, leaving only tokens of numerical digits.
Another option would be to write a Pig UDF.
REGEX_EXTRACT function may help you to get your desired output
REGEX_EXTRACT(input, 'foo=(.*)',2) AS input;
精彩评论