How Can I Speed Up This Anagram Algorithm
I am making a mobile app to find anagrams and partial matches. Mobile is important because there is not a whole lot of computational power, and efficiency is key.
The algorithm takes any number of letters, including repeats, and finds the longest words made up from its letters, using every letter only once. I am also interested in finding the top results quickly, and am not really concerned with the bottoms (shorter ones) as long as N is met. For example:
STACK => stack, tacks开发者_StackOverflow中文版, acts, cask, cast, cats…
I have done some googling and have found a few algorithms, and I came up with one which I thought would be efficient, but is not as efficient as I would like.
I have a lookup dictionary pre-made that maps a sorted key to the real words that generate that key.
"aelpp" => ["apple", "appel", "pepla"]
I have further split each dictionary into different ones based on the length of the key. So keys that are 5 letters long are in one dictionary, keys that are 6 are in another. Each of these dictionaries are in an array in which the index is the length of the keys that are found in the dictionary.
anagramArray[5] => dictionary5
dictionary5["aelpp"] => ["apple", "appel", "pepla"]
My algorithm starts by taking an input word "lappe
", and it sorts it:
"lappe" => "aelpp"
Now, for each dictionary that has contents of at most 5 letters, I do a comparison to pull it out. Here is pseudocode:
word = input.sort
for (i = word.length; i > 0; i--)
dictionaryN = array[i]
for (key in dictionaryN)
if word matches key
add to returnArray
end
end
if returnArray count > N
break
end
end
returnArray.sort by longest word, alphabetize
The dictionary only has about 170,000 words in it, but searches are taking up to 20 seconds for 12 letter inputs. My match
method makes a regex out of the key:
"ackst" => /a.*c.*k.*s.*t.*/
such that, for example, a 4 letter key such as acst
(acts), will match ackst
(stack) because:
"ackst" matches /a.*c.*s.*t.*/
I have seen other apps do the same thing in much less time, and I am wondering if my approach is garbage or just needs some tweaking.
How can I get the maximum computational efficiency for generating the top N anagrams from a word, sorted by max length?
If you think of (and perhaps even represent) a dictionary as a tree of letters you can avoid looking at lots of nodes. If "stack" is in the dictionary, then there will be a path from the root to a leaf labelled a-c-k-s-t. If the input word is "attacks" then sort this to get aackstt. You can write a recursive routine to follow links down from the root, consuming letters from aackstt as you go. When you reach a-c-k you will have stt left in your string, so you can follow the s to reach ackst, but you can rule out following u to reach a-c-k-u and its descendants, v to reach a-c-k-v and its descendants, and so on.
In fact, with this scheme, you could use just one tree to hold words of any number of letters, which should save you doing multiple searches, one for each target length.
Generating regular expressions is a bit expensive, and so you probably don't want to be doing that inside a loop.
An option (not necessarily super-efficient, but it seems useful in this particular case) that comes to mind is that instead of searching through all of your words in your dictionary, try removing letters in various combinations and checking if the resultant string is in your dictionary. This will max out at 2^n iterations (where n is the number of letters in your word), which is better than 170k for n < 18. Note that this particular approach doesn't hold up with long inputs, but it should be very fast otherwise.
Build your dictionary as follows:
For each word W in the English language (or whatever word set you have)
Sort the characters in W by alphabetical order (e.g. "apple" -> "aelpp") into a new string called W'
Compute Hash H into W' using any fast hash algorithm (e.g CRC32. You could likely invent anything yourself that has a low number of collisions)
Store W and H as an element in the dictionary array
That is:
Word.original = W;
Word.hash = Hash(W');
Dictionary.append(Word);
Sort the dictionary by hash values.
And now to find all the Anagrams or search word S
Sort the characters in S by alphabetical order (e.g. "apple" -> "aelpp") into a new string called S'
Compute Hash H of S' using the same fast hash algorithm above
Now do a binary search on the dictionary for H. The binary search should return an index F into Dictionary
If the binary search fails to return an index into the Dictionary array, exit and return nothing
I = F
// Scan forward in the dictionary array looking for matches
// a matching hash value is not a guarantee of an anagram match
while (I < Dictionary.size) && (Dictionary[I].hash == H)
if (IsAnagram(Dictonary[I], S)
ResultSet.append(Dictionary[I].original)
// Scan backwards in the dictionary array looking for matches
I = F-1;
while (I >= 0) && (Dictionary[I].hash == H)
if (IsAnagram(Dictonary[I], S)
ResultSet.append(Dictionary[I].original)
return ResultSet
Now I didn't cover how to handle "substring" searches (searching for anagram words that are smaller in length that the search word. I was a bit confused if this was a requirement. Your instructions imply that that resuling set of anagrams should have the exact same set of characters as the search word. But you can likely enumerate through all the substrings of your search string and run each substring through the Search algorithm described above.
This is just an idea, but maybe it's what you're looking for. You have only one structure that you iterate through, and all words of all sizes are in it. With each iteration step you introduce one more letter, and you narrow search only to words that do not have letters "larger" than the ones already introduced. For instance, if you introduce M you can't introduce anything in range N-Z anymore.
The structure could be a binary tree where introduction of one letter leads you several tree levels further. Each node has a letter, and branch to the rest of smaller letters, and the branch to the rest of larger letters, and a branch to the root of next narrowed search, and a pointer to the list of words that are completely built with letters introduced so far. Branches may be null if there are no possible words in that search sub-space, but you can't have at the same time null for 3 branches and null for pointer to the list of words. (Well you can, as a sort of optimization, which is irrelevant right now). Instead of pointer to list of words you can have a flag denoting existence of words with given letters, but those words can be stored in some other dictionary.
So let's say we have letters ACKST. From the structure's root you search for all of those letters in a loop, but after K for instance you may only continue search with A and C (since S and T are above K). Because we're most interested in largest word we should start the search from largest letter (in this case T), and continue it with the next largest letter. To the word CAT we can only arrive searching for letters T,C,A in that specific order. Once we arrive at that A there will be a pointer to list of following words: ACT, CAT.
O(N) Time and O(1) solution to check if 2 strings are anagrams
bool Anagram( const char *s1, const char *s2)
{
unsigned int sum=0;
if ( s1 == NULL || s2 == NULL)
return false;
while ( *s1 != '\0' && s2 != '\0')
{
sum ^= *s1;
sum ^= *s2;
s1++;
s2++;
}
if ( s1 != '\0' || s2 != '\0')
return false;
if (sum) return false;
return true;
}
If you xor two equal numbers..your result is 0. (hence the algorithm)
精彩评论