Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?
Assuming I have lines of data like the following that show user names and their favorite fruits:
Alice\tApple
Bob\tApple
Charlie\tGuava
Alice\tOrange
I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". F开发者_如何学Pythonor example, the result with the data above should be:
Alice\tMultiple
Bob\tApple
Charlie\tGuava
In SQL, this could be done something like this (although it wouldn't necessarily perform very well):
select user, case when count(fruit) > 1 then 'Multiple' else max(fruit) end
from FruitPreferences
group by user
But I can't figure out the equivalent PigLatin. Any ideas?
Write a "Aggregate Function" Pig UDF (scroll down to "Aggregate Functions"). This is a user-defined function that takes a bag and outputs a scalar. So basically, your UDF would take in the bag, determine if there is more than one item in it, and transform it accordingly with an if statement.
I can think of a way of doing this without a UDF, but it is definitely awkward. After your GROUP
, use SPLIT
to split your data set into two: one in which the count is 1 and one in which the count is more than one:
SPLIT grouped INTO one IF COUNT(fruit) == 0, more IF COUNT(fruit) > 0;
Then, separately use FOREACH ... GENERATE
on each to transform it:
one = FOREACH one GENERATE name, MAX(fruit); -- hack using MAX to get the item
more = FOREACH more GENERATE name, 'Multiple';
Finally, union them back:
out = UNION one, more;
I haven't really found a better way of handing the same data set in two different ways based on some conditional, like you want. I typically do some sort of split/recombine like I did here. I believe Pig will be smart and make a plan that doesn't use more than 1 M/R job.
Disclaimer: I can't actually test this code at the moment, so it may have some mistakes.
Update:
In looking harder, I was reminded of the bicond operator and I think that will work here.
b = FOREACH a GENERATE name, (COUNT(fruit)==1 ? MAX(FRUIT) : 'Multiple');
精彩评论