开发者

Compare/Diff Multiple (>Millions) Arrays

I'm not sure if this is possible; but I have millions of "lists" in a MySQL database, and would like develop a system where I take one of the lists; and compare it against all of the other lists in the database and return:

1.) Lists that closely resemble the primary list (some sort of % would be great)

2.) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).

I wou开发者_如何学JAVAld've intially thought this would've been possible if I could create some sort of 'loose hash' that I can compare lists mathematically, but I haven't been able to find a solution that scales (since this is exponential when tackled head-on).

Any new ideas/solutions would be greatly appreciated. Thanks!


Your basic MD5 is a (somewhat) loose hash, supported by both php and mysql and quite fast in these kind of things. Just get an MD5 of what ever data and compare it to others.

Do it in PHP, store the MD5 of the data in array key an use if isset().


Your part 2) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).

is not very clear, but I interpret it as: Given few items, find all lists that contain all or most of the items.

This should be easy once you create an index on your list elements, essentially like a hash table. The exact query will depend on your requirement, length of lists (whether that is a factor in defining the specs, etc).


If you're saying there are millions of lists, itnis really not an option to load them all into a php script. You could get the values of the list you are comparing the others to, and then run an SQL query similar to this:

SELECT list_id, COUNT(value) as c FROM lists WHERE value IN (a,b,c) GROUP BY list_id 
ORDER BY c DESC

I'm not sure the sql is correct, but the idea is to select the ids of the lists that have the same members in them and then sort the output by the number of list items that intersect with the original list. The percenage of item correspondence is easily obtained in this case.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜