Implementing cross join in hadoop

2023-03-16 10:41 问答作者：

I am trying to implement cross join using hadoop in java. Both sides of the join are large enough that I can't keep any of them in memory. I have tried several things and although I realize that PIG/hive might be easier I would like to implement it native java.

I think CompositeInputFormat开发者_JAVA百科 might be the way to do this but I haven't been able to find any sample code.

I have tried to send tagged data to SequenceFileInputFormat and tired to use the Reducer to join the data but it didn't work either. ( I can provide more details, if this is the right way ).

Is there some sample code that I can have a look at?

CompositeInputFormat requires both sets of data to be sorted and partitioned by the join key.

What you probably want to do what you tried, which is called a reduce-side join. Google it for more info, or check out the discussion in the Hadoop book. You tag the value with the original data set and have the key be the join-by/foreign key. In the reducer then the two sets are together and you can do whatever sort of joining behavior you want to do.

You are right that doing joins like this are simpler in Pig/Hive. Pig example:

A = LOAD ...
B = LOAD ...
JOINED = JOIN A BY $0, B BY $0;

继续阅读：join

Implementing cross join in hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？