Data structure for querying whether a given subset exists in a collection of sets

2023-02-15 10:10 问答作者：

I'm trying to build a data structure for a word game solver.

I need to store about 150,000 sets of the form {A, A, D, E, I, L, P, T, V, Y}. (They are normalized English words, i.e. ch开发者_运维知识库aracters sorted. Note this is a multiset which can contain the same letter twice.)

Need to efficiently get a yes/no answer to the following kind of query: are there any sets with the given subset? For example, do any of the known words contain the set {D, E, I, L, L, P}?

Requirements:

Queries must be fast
The data structure should fit in a reasonable amount of space (e.g. <50 MB)
The data structure need not be built in real time; it's pre-computed.

Is there any data structure out there that would suit this need well? This is a little different from other set matching questions on StackOverflow in that the target sets are actually multisets.

This reminds me of a mutated prefix tree/trie that I made once. Slightly different but it might work. It may not work if you have large/no bounds or if you can't convert it to your language (I code in c++).

So basically, in a trie you usually store children corresponding to the next letter but what I did was I stored children corresponding to the frequency of each letter.

What the question basically is (from my point of view) is, "Are there any sets that have the same or more of a letter than in the subset?" For example, if the subset is { A,D,E,E } then you need to find if there is a set with at least one A, one D and two E's.

So, for the trie you have something like this

            Root
           / | \
          / /|\ \
         / / | \ \
        1 2  ... MAX <-- This represents the frequency of "A"
       /|\ ..... /|\
      1..MAX    1..MAX <-- Frequency of "B"
      ...............
      ...............
      ...............
     1 ... ... ... MAX <-- Frequency of "Y"
    /|\ .... .... / | \
   1..MAX ...... 1 .. MAX <-- Frequency of "Z"

Basically all of the ...'s represent lots of stuff that would take too long to show. The /,| and \ represent parent child relationship and MAX represent the maximum frequency of a letter

So what you do, is you have a struct (I code in c++) of some sort like:

struct NODE {
    NODE *child[MAX + 1]; // Pointers to other NODE's that represents
                          // the frequency of the next letter
};

When you create a node you need to initialize all its children to NULL. You can either do this through a constructor (in c++) or a makeNode() function like

NODE* makeNode() {
    NODE* n = new NODE;         // Create a NODE
    for(int i = 0;i <= MAX;i++) // For each child
        n->child[i] = NULL;     // Initialize to NULL
};

At the start, the trie is just a root

NODE* root = new NODE;

When you add a set to the trie, you get the frequency of each letter and go through the trie. If at a particular node, the child corresponding to the next letter is NULL, you just create a new NODE.

When you search the trie, you search all of the children of each node that corresponds to the frequency of the letter in the subset or larger. For example if the subset has 3 A's you search all of the root->child[3] then root->child[4] then ... then root->child[MAX].

It's probably overly complicated and confusing so 1) If you think I'm not mad then please comment on what's confusing and 2) You may/probably want to just find a simpler method

Looks like you could try using KD-Trees or a variant.

A related topic to explore would be multi-dimensional range searching/querying.

Caveat: I haven't used these myself, but I hope you might be able to find something useful by reading some literature on the topic above.

Hope that helps.

You could probably use a trie and insert each set into the trie, traversing the trie iteratively using your target subset to find out if you have a matching subset. At least that's how I think I would do it.

The 'trie' was actually conceived for a reTRIEvable data structure, and is pretty much like a normal tree, but has nodes with different permutations, for example:

     A
    / \
   AT AN
     / | \
    |  |  AND
   ANN ANY
    |
  ANNA

In the above example, you can see that this is probably useful for your case, as ANN and ANNA can be retrieved like a set. You might want to use some permutation code, along with this type of ADT (Abstract Data Type).

Find more here

继续阅读：algorithm data-structures multiset subset

Data structure for querying whether a given subset exists in a collection of sets

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？