Similarity join using Hadoop

2023-01-22 08:41 问答作者：

I'm new to hadoop. I'd l开发者_运维问答ike to run some approaches with you that I came up with.

Problem:

2 datasets : A and B.

Both datasets represent songs: some top level attributes, titles (1..), performers (1..).

I need to match these datasets either using equality or fuzzy algorithms (such as levenshtein , jaccard, jaro-winkler, etc) based on titles and performer.

The dataset sizes are: A=20-30M , B~=1-6M.

So here there are approaches that I came up with:

Load dataset B(smallest) into HDFS. Use mapreduce against dataset A(biggest) , where:
map phase : for each record in A access HDFS and pull records B for matching;
reduce phase : writes id pairs
load dataset A into distirubted cache (i.e. jboss cache) in optimized form to speed up searching. Use mapreduce against dataset B, where :
map phase: for each record in B query distributed cache for matching
reduce : writes id pairs
use mapreduce to join both datasets, where
map phase: gets a record from set A and set B , does matching
reduce phase: same
(I'm fuzzy about ths one. 1st: join will be the cartesian product with trillion of records; 2nd: not sure how hadoop can parallize that across cluster)
use hive (i'm looking at right now trying to figure out how to plugin custom functions that will do string matching)

I'm loooking for a pointers, which approach would be the best candidate or maybe there are some other approaches that I do not see.

You might find this paper and code useful:

Efficient Parallel Set-Similarity Joins Using MapReduce

I've personally implemented it in Cascading with good results. Unfortunately the code is too domain specific to release.

The point of the above work is to reduce the number joins to the candidate pairs that are very likely similar, then the candidate pairs can be compared directly (in a MR join) using any cocktail of algorithms that are relevant. A good side effect is that this join can be performed evenly across the cluster without duplicate comparisons.

Ultimately this is an optimization on performing a cross join between two independent sets or within the same set (the second case implemented slightly differently than the first).

disclosure: I'm the author of Cascading

Have a look at

http://dbs.uni-leipzig.de/en/publication/learning_based_er_with_mr -> Evaluation of the Cartesian prodzuct of two (large) input sets

However you should really try to avoid doing pair-wise similarity computation (Levenshtein, etc.) on the Cartesian product. Even with large clusters it will take hours to days even for medium-sized datasets.
http://dbs.uni-leipzig.de/en/publication/lb_for_mr_based_er -> Explains how Blocking/Clustering approaches with pairwise comparison per cluster can be realized while ensuring evenly loaded tasks (single and dual-source)

You might want to look at these two papers by Jimmy Lin:

Brute Force and Indexed Approaches to Pairwise Document Similarity
Pairwise Document Similarity in Large Collections with MapReduce

The approach you take will be dependent on what kind of similarity metric you use, but a Lucene based approach may work here. You might also want to think of ways to partitioning the data to reduce the number of comparisons you need to make.

Similarity join using Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？