Data structure for Pattern Matching on large data
Problem Background
I have a finite vocabulary containing of say 10 symbols [A-J]. What these symbols mean is not relevant to the question. They could be DNA bases, phonemes, words etc.
An item is a sequence of symbols. In this problem, all items are of the same length (say 6). E.g.
A C B A D J
I have a large (5M) table that contains counts of all items of length 6 sampled from some known data. E.g.
A C B A D J 55
B C B I C F 923
A B C D E G 478
Given a new sequence with one unknown symbol, my task is to guess the symbol. In the following example, the missing 开发者_StackOverflowsymbol is ?.
B C B ? C F
A simple solution to guess ? is to look into my table and find the item with the largest count that fits the pattern B C B ? C F
Questions
What is a good data structure to store my item-frequency table so that I handle space-time reasonably efficiently? I prefer to use less memory if the computation at query time is reasonable. (I will be having many such tables and so the 5M number is just an approximation.)
What are some implementation details that can make a big difference in processing speed?
Things I have thought of:
Make a string out of every sequence and use regexes to match. Caveat: 1. O(n) is unacceptable. (2) Regexes are slow. (3) Strings (in java at least) are bloated.
Let Lucene handle the indexing. Turn off tfidf scoring. Use phrase-search. Potentially use the count values for scoring so that Lucene takes care of the sorting too.
Use prefix and suffix tries to index each item.
Use db (likely in-memory) with the entire data in one/separate column to handle the search.
Updates
- In my actual application, I will be working with sequences of length 5,6,7,8,9,10 stored separately. I simplified the problem by restricting it to a fixed length. Hence the constraint/preference to a solution that uses less memory.
- My vocabulary size can be assumed to be under 20.
Decision with tries seems to be the best one: with number of string occurrence on leaves you can easily design function that will return all possible strings with one missing character in O(log n) time, and then you just iterate over this small number of strings, searching for the max number of occurrences. If you use chars from A to Z, there will be at most 26 such strings, so iterating will not take a lot.
AFAIK, Lucene uses such mechanism internally for its wildcards search, so you can concatenate your chars, index them with KeywordAnalyzer
(to omit stemming) and then search as "ACB?DJ". The only restriction here is that Lucene cannot handle searches with first "?", but you can bypass it by adding one extra char at the beginning (just trick to bypass Lucene checks) or by having one more index for reversed words (will increase performance for words with wildcard as a first char a lot).
And, finally, if you first have to calculate number of occurrences anyway, you can use some machine learning schemes such as decision trees to handle all the work. There were cases, when decision trees were used to compress database and speed up search, so you can do the same. Use lines as instances, position of chars as attributes and chars themselves as attribute values. Then run some algorithm like C4.5 (you can use Weka's implementation called J48) with minimal pruning and run classification - the algorithm will do the rest!
Based on the comment that there will only be 1 unknown you can do the following:
But your data in a hashtable. When you need to look up a pattern, generate all the wildcard combinations, since your vocabulary is limited this would mean at most looking up 20 patterns. This sounds like a hack, but if you consider the performance implications of other methods it is hard to beat. The hash table lookup is O(1), 20 lookups is O(1) too.
This method is not advisable if the number of wildcards could increase, although it may still perform well for 2 or 3.
A double array trie would also work and may reduce the amount of space to store your strings, but the performance would suffer.
In order to uniquely characterize a new sequence, two pieces of information are necessary: the sequence (string) of five known symbols, and the position of the unknown symbol. If your alphabet has 10 symbols then there can be no more than 10^5 = 100000 unique five-symbol strings.
Depending on your memory resources, this may be small enough to fit into a hashtable whose entries provide a lookup structure to find the best (position, symbol) combination. For example:
---------
| BCBCF | --> { 0: <heap of symbols partially ordered by frequency>, ... }
---------
This should allow for a fairly efficient lookup for a new sequence: concatenate the known symbols, look up the sequence in the hashtable, find the position of the unknown character, and then return the symbol that is at the top of the corresponding heap.
If you can guarantee the lookup structure will be stable (no new input) before any lookup is performed, then you can squeeze a bit more efficiency out of it by replacing each of the position-indexed heaps with the single symbol that would have been at the top of the heap. (The heap structure is only necessary during the lookup phase if new information will come in that may change the symbol frequencies.)
The db would be easy solution, but another solution is a tree where each node chooses one character and leaf would contain array of possible results and counts. Then it would only take 5 steps in the tree to match one string. But creating the tree would take N*C time where N is number of items and C is number of characters in each item. Wildcards is just a node in the tree that will simultaniously remove one character from input but keeps the possible results intact.
I was among the "everyone missing the obvious" here.
Just use any quick key/value lookup that is available to you. And look up all of your possible values. It is a small set, and won't take long. Anything else short of storing your data 6 times will be slower.
If you had a large possible vocabulary, then my previous answer would be appropriate.
Here is my old (and bad) answer.
I would stick them in a database with multiple concatenated indexes. How many is up to you.
At a minimum I would have 2. I would have an index on (col1, col2, col3, col4, col5, col6)
and (col4, col5, col6, col1, col2, col3)
. This would mean that, no matter which column was missing, there would be a way to get your data and only look through at most 1/1000 of the records. If you wish you could instead index (col1, col2, col3, col4, col5, col6)
, (col3, col4, col5, col6, col1, col2)
and (col5, col6, col1, col2, col3, col4)
to limit your search to 1/10000 of the data. This uses half again as much memory, but is 10 times faster. (Warning, I will not guarantee that MySQL will successfully figure out which index it should use. I'd hope that other databases would get it right, but haven't tested it.)
If you wished to not use a database you could use balanced binary trees exactly as I was suggesting using indexes above. For any given search pick the tree that has the missing element as deep as possible. Do a range search. Filter the returned data for just the rows of interest. This is, in fact, exactly what a good database should do above with those indexes.
精彩评论