Looking for the best tool to do large-scale set comparisons [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve t开发者_开发问答his questionI'm working on a project that requires finding the most intersected set among a great number of other sets.
That is, I have a large number (~300k) of sets with hundreds of entries each. Given one of the sets, I need to rank the other sets in order of how intersected they are. Additionally, the set entries contain properties which can be used as a filter, e.g. For set X, order the other sets by how much they intersect with the "green" entries subset.
I have free reign to architect this solution, and I'm looking for technology recommendations. I was initially thinking a relational DB would be the best suited, but I'm not sure how well it will perform doing these real time comparisons. Somebody recommended Lucene, but I'm not sure how well that would fit the bill.
I suppose it's worth mentioning that new sets will be added regularly and that the sets may grow, but never shrink.
I don't know exactly what you are looking for: method, library, tool?
If you want to compute your large datasets really fast with distributed computing, you should check out MapReduce, e.g. using Hadoop on Amazon EC2/S3 services.
Lucene can easily scale to what you need. Solr will probably be easier to set up, and hadoop is most likely overkill for only a few million data points.
Something you need to think about is what definition of "how intersected" you want to use. If all the sets have the same size I suppose it's easy, but Jaccard distance might make more sense in other contexts; Lucene's default scoring is often good too.
My advice would be: try running the default Solr instance on your local workstation (it's a cllick-and-run jar type of deal). You'll know pretty quickly whether Solr/Lucene will work for you or if you'll have to custom code your own thing via Hadoop etc.
精彩评论