开发者

Implementing cross join in hadoop

I am trying to implement cross join using hadoop in java. Both sides of the join are large enough that I can't keep any of them in memory. I have tried several things and although I realize that PIG/hive might be easier I would like to implement it native java.

I think CompositeInputFormat开发者_JAVA百科 might be the way to do this but I haven't been able to find any sample code.

I have tried to send tagged data to SequenceFileInputFormat and tired to use the Reducer to join the data but it didn't work either. ( I can provide more details, if this is the right way ).

Is there some sample code that I can have a look at?


CompositeInputFormat requires both sets of data to be sorted and partitioned by the join key.

What you probably want to do what you tried, which is called a reduce-side join. Google it for more info, or check out the discussion in the Hadoop book. You tag the value with the original data set and have the key be the join-by/foreign key. In the reducer then the two sets are together and you can do whatever sort of joining behavior you want to do.

You are right that doing joins like this are simpler in Pig/Hive. Pig example:

A = LOAD ...
B = LOAD ...
JOINED = JOIN A BY $0, B BY $0;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜