Hadoop: intervals and JOIN

2022-12-13 04:09 问答作者：

I'm very new to Hadoop and I'm currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example:

input1:

20091001-20091002    A
20091011-20091104    B
20080111-20091103    C
(...)

input2:

20090902-20091003    D
20081015-20091204    E
20040011-20050101    F
(...)

I'd like to find all the records where the key1 overlaps the key2. 开发者_运维百科Is it possible with hadoop ? Where can I find an example of implementation ?

Thanks.

A solution was given on Biostar: http://biostar.stackexchange.com/questions/8821

I think all that's needed is a key class where hashCode() and equals() do what you want them to do. I suspect that you might encounter a problem where A overlaps B (i.e. A.equals(B) == true), B overlaps C, but C doesn't overlap A. If you implement such an equals() method, you'll probably get strange behaviour.

Basically, you want to do something like stabbing queries on a Segment Tree (i.e. for all overlapping intervals E for an interval (p1.start, p1.end), perform stabbing queries for p1.start and p1.end).

But basically, no, I don't know a correct answer to your question. But maybe a query for "Segment tree" hadoop will get you started.

继续阅读：intervals join

Hadoop: intervals and JOIN

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？