开发者

Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?

Assuming I have lines of data like the following that show user names and their favorite fruits:

Alice\tApple
Bob\tApple
Charlie\tGuava
Alice\tOrange

I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". F开发者_如何学Pythonor example, the result with the data above should be:

Alice\tMultiple
Bob\tApple
Charlie\tGuava

In SQL, this could be done something like this (although it wouldn't necessarily perform very well):

select user, case when count(fruit) > 1 then 'Multiple' else max(fruit) end
from FruitPreferences
group by user

But I can't figure out the equivalent PigLatin. Any ideas?


Write a "Aggregate Function" Pig UDF (scroll down to "Aggregate Functions"). This is a user-defined function that takes a bag and outputs a scalar. So basically, your UDF would take in the bag, determine if there is more than one item in it, and transform it accordingly with an if statement.

I can think of a way of doing this without a UDF, but it is definitely awkward. After your GROUP, use SPLIT to split your data set into two: one in which the count is 1 and one in which the count is more than one:

SPLIT grouped INTO one IF COUNT(fruit) == 0, more IF COUNT(fruit) > 0;

Then, separately use FOREACH ... GENERATE on each to transform it:

one = FOREACH one GENERATE name, MAX(fruit); -- hack using MAX to get the item
more = FOREACH more GENERATE name, 'Multiple';

Finally, union them back:

out = UNION one, more;

I haven't really found a better way of handing the same data set in two different ways based on some conditional, like you want. I typically do some sort of split/recombine like I did here. I believe Pig will be smart and make a plan that doesn't use more than 1 M/R job.

Disclaimer: I can't actually test this code at the moment, so it may have some mistakes.


Update:

In looking harder, I was reminded of the bicond operator and I think that will work here.

b = FOREACH a GENERATE name, (COUNT(fruit)==1 ? MAX(FRUIT) : 'Multiple');
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜