开发者

Joining very large lists [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 4 years ago.

Improve this question

Lets put some numbers first: The largest of the list is about 100M records. (but is expected to grow upto 500). The other lists (5-6 of them) are in millions but would be less than 100M for the foreseeable future. These are always joined based on a single id. and never with any other parameters. Whats the best algorithm to join such lists?

I was thinking in lines of distributed computing. Have a good hash (the circular hash kinds, where you can add a node and there's not a lot of data movement) function and have these lists split into several smaller files. And since, they are always joined on the common id (which i will be hashing) it would boil down to joining to small files. And maybe use the nix join commands for that.

A DB (at least MySQL) would join using merge join (since it would be on primary key). Is that going to be more efficient that my approach?

I know its best to test and see. But given the magnitute of these files, its pretty time consuming. And I would like to do som开发者_Go百科e theoretical calculation and then see how it fairs in practice.

Any insights on these or other ideas would be helpful. I dont mind if it takes slightly longer, but would prefer the best utilization of the resources I have. Don't have a huge budget :)


Use a Database. They are designed for performing joins (with the right indexes of course!)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜