开发者

An efficient way to find matching items in N lists?

Given a number of lists of items, find the lists with matching items.

The brute force pseudo-code for this problem looks like:

foreach list L
    foreach item I in list L
        foreach list L2 such that L2 != L  
            for each item I2 in L2
                if I == I2
                    return new 3-tuple(L, L2, I) //not important for the algorithm

I can think of a number of different ways of going about this - creating a list of lists and removing each candidate list after searching the others 开发者_高级运维for example - but I'm wondering if there is a better algorithm for this?

I'm using Java, if that makes a difference to your implementation.

Thanks


  1. Create a Map<Item,List<List>>.
  2. Iterate through every item in every list.
  3. each time you touch an item, add the current list to that item's entry in the Map.

You now have a Map entry for each item that tells you what lists that item appears in.

This algorithm is about O(N) where N is the number of lists (the exact complexity will be affected by how good your Map implementation is). I believe your algorithm was at least O(N^2).

Caveat: I am comparing number of comparisons, not memory use. If your lists are super huge and full of mostly non duplicated items, the map that my method creates might become too big.


As per your comment you want a MultiMap implementation. A multimap is like a Map but it can map each key to multiple values. Store the value and a reference to all the maps that contain that value.

Map<Object, List>

of course you should use a type safe instead of Object and a type safe List as the value. What you are trying to do is called an Inverted Index.


I'll start with the assumption that the datasets can fit in memory. If not, then you will need something fancier.

I refer below to a "set", where I am thinking of something like a C++ std::set. I don't know the Java equivalent, but any storage scheme that permits rapid lookup (tree, hash table, whatever).

Comparing three lists: L0, L1 and L2.

  1. Read L0, placing each element in a set: S0.
  2. Read L1, placing items that match an element of S0 into a new set: S1, and discarding others.
  3. Discard S0.
  4. Read L2, keeping items that match an element of S1 and discarding others.

Update Just realised that the question was for "n" lists, not three. However the extension should be obvious. (I hope)

Update 2 Some untested C++ code to illustrate the algorithm

#include <string>
#include <vector>
#include <set>
#include <cassert>

typedef std::vector<std::string> strlist_t;

strlist_t GetMatches(std::vector<strlist_t> vLists)
{
    assert(vLists.size() > 1);
    std::set<std::string> s0, s1;
    std::set<std::string> *pOld = &s1;
    std::set<std::string> *pNew = &s0;

    // unconditionally load first list as "new"
    s0.insert(vLists[0].begin(), vLists[0].end());

    for (size_t i=1; i<vLists.size(); ++i)
    {
        //swap recently read "new" to "old" now for comparison with new list
        std::swap(pOld, pNew);
        pNew->clear();

        // only keep new elements if they are matched in old list
        for (size_t j=0; j<vLists[i].size(); ++j)
        {
            if (pOld->end() != pOld->find(vLists[i][j]))
            {
                // found match
                pNew->insert(vLists[i][j]);
            }
        }
    }
    return strlist_t(pNew->begin(), pNew->end());
}


You can use a trie, modified to record what lists each node belongs to.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜