Implementing cross join in hadoop
I am trying to implement cross join using hadoop in java. Both sides of the join are large enough that I can't keep any of them in memory. I have tried several things and although I realize that PIG/hive might be easier I would like to implement it native java.
I think CompositeInputFormat开发者_JAVA百科
might be the way to do this but I haven't been able to find any sample code.
I have tried to send tagged data to SequenceFileInputFormat
and tired to use the Reducer
to join the data but it didn't work either. ( I can provide more details, if this is the right way ).
Is there some sample code that I can have a look at?
CompositeInputFormat
requires both sets of data to be sorted and partitioned by the join key.
What you probably want to do what you tried, which is called a reduce-side join. Google it for more info, or check out the discussion in the Hadoop book. You tag the value with the original data set and have the key be the join-by/foreign key. In the reducer then the two sets are together and you can do whatever sort of joining behavior you want to do.
You are right that doing joins like this are simpler in Pig/Hive. Pig example:
A = LOAD ...
B = LOAD ...
JOINED = JOIN A BY $0, B BY $0;
精彩评论