Data structure for querying whether a given subset exists in a collection of sets
I'm trying to build a data structure for a word game solver.
I need to store about 150,000 sets of the form {A, A, D, E, I, L, P, T, V, Y}. (They are normalized English words, i.e. ch开发者_运维知识库aracters sorted. Note this is a multiset which can contain the same letter twice.)
Need to efficiently get a yes/no answer to the following kind of query: are there any sets with the given subset? For example, do any of the known words contain the set {D, E, I, L, L, P}?
Requirements:
- Queries must be fast
- The data structure should fit in a reasonable amount of space (e.g. <50 MB)
- The data structure need not be built in real time; it's pre-computed.
Is there any data structure out there that would suit this need well? This is a little different from other set matching questions on StackOverflow in that the target sets are actually multisets.
This reminds me of a mutated prefix tree/trie that I made once. Slightly different but it might work. It may not work if you have large/no bounds or if you can't convert it to your language (I code in c++).
So basically, in a trie you usually store children corresponding to the next letter but what I did was I stored children corresponding to the frequency of each letter.
What the question basically is (from my point of view) is, "Are there any sets that have the same or more of a letter than in the subset?" For example, if the subset is { A,D,E,E } then you need to find if there is a set with at least one A, one D and two E's.
So, for the trie you have something like this
Root
/ | \
/ /|\ \
/ / | \ \
1 2 ... MAX <-- This represents the frequency of "A"
/|\ ..... /|\
1..MAX 1..MAX <-- Frequency of "B"
...............
...............
...............
1 ... ... ... MAX <-- Frequency of "Y"
/|\ .... .... / | \
1..MAX ...... 1 .. MAX <-- Frequency of "Z"
Basically all of the ...'s represent lots of stuff that would take too long to show. The /,| and \ represent parent child relationship and MAX represent the maximum frequency of a letter
So what you do, is you have a struct (I code in c++) of some sort like:
struct NODE {
NODE *child[MAX + 1]; // Pointers to other NODE's that represents
// the frequency of the next letter
};
When you create a node you need to initialize all its children to NULL. You can either do this through a constructor (in c++) or a makeNode() function like
NODE* makeNode() {
NODE* n = new NODE; // Create a NODE
for(int i = 0;i <= MAX;i++) // For each child
n->child[i] = NULL; // Initialize to NULL
};
At the start, the trie is just a root
NODE* root = new NODE;
When you add a set to the trie, you get the frequency of each letter and go through the trie. If at a particular node, the child corresponding to the next letter is NULL, you just create a new NODE.
When you search the trie, you search all of the children of each node that corresponds to the frequency of the letter in the subset or larger. For example if the subset has 3 A's you search all of the root->child[3] then root->child[4] then ... then root->child[MAX].
It's probably overly complicated and confusing so 1) If you think I'm not mad then please comment on what's confusing and 2) You may/probably want to just find a simpler method
Looks like you could try using KD-Trees or a variant.
A related topic to explore would be multi-dimensional range searching/querying.
Caveat: I haven't used these myself, but I hope you might be able to find something useful by reading some literature on the topic above.
Hope that helps.
You could probably use a trie and insert each set into the trie, traversing the trie iteratively using your target subset to find out if you have a matching subset. At least that's how I think I would do it.
The 'trie' was actually conceived for a reTRIEvable data structure, and is pretty much like a normal tree, but has nodes with different permutations, for example:
A
/ \
AT AN
/ | \
| | AND
ANN ANY
|
ANNA
In the above example, you can see that this is probably useful for your case, as ANN and ANNA can be retrieved like a set. You might want to use some permutation code, along with this type of ADT (Abstract Data Type).
Find more here
精彩评论