Can I generate nested bags using nested FOREACH statements in Pig Latin?
Let's say I have a data set of restaurant reviews:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
And I want to produce a list by user and city of aver开发者_如何学编程age review. I.e. output:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
I could write a Pig script as follows:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
I get:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?
My goal is to do something like:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}
Currently the allowed operations are DISTINCT
, FILTER
, LIMIT
, and ORDER BY
inside a FOREACH.
For now grouping directly by (user, city) is the good way to do as you said.
Release notes for Pig version 0.10 suggest that nested FOREACH operations are now supported.
Try this:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
average = foreach groupbyusercity {
generate group.user,group.city,AVG(data.rating);
}
dump average;
Grouping by two keys and then flattening the structure leads to the same result:
Loading Data like you did
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
Group by user and city
ByUserByCity = GROUP Data BY (user, city);
Add Rating average of the groups (you can add more, like COUNT(Data) as count_res) Then flatten the group structure to the original one.
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
Results in:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,
精彩评论