Finding the set of words that can be spelled with a given set of letters
The problem I'm facing is to create an efficient algorithm that returns the list of words from a fixed word list that can be spelled by using a certain set of letters (this set being the input), with this set having no upper bound (or no useful one at least).
More specifically, this algorithm needs to provide solutions in real time for a constantly augmenting input, though the output can be streamed out slowly as long as the first few output elements come out relatively quickly.
**Note: the input's real augmentation occurs only in the form of adding on to the existing set. When a word is chosen, the LetterStock is depleted o开发者_JS百科f all the letters used and the algorithm is run from scratch again.
The set of possible characters are the 26 standard roman letters. The stock of each letter can be anywhere from 0->infinity (for all intensive purposes) and each letter can only be used once (e.g. a,b,l will not return "ball", though a,bb,d,lll will)
The word list being worked with is constant, so if any pre-calculated hashing solution would be useful to speed up runtime performance, that is an option (though I cannot think of a good way to do this). However, the word list is large (~40,000 elements).
Thus far, the best algorithm I've devised involves iterating through the whole set of words using a copy of the letter stock array and checking each word letter-by-letter, depleting the copy stock and seeing if any letter goes below 0. If so, I move on, if not and I reach the end of the word, I add the word to the solution set. After going through the whole set of words, I start from the beginning again looking for words that I can spell now that 1 or more character have been added to my LetterStock.
// Pass in the string and a copy of the LetterStock 26-entry array
bool canBeSpelled(string str, int *letterStock)
{
for(int i=0; i<str.length; i++)
{
letterStock[str[i]-65]--; // deplete the letter stock for the given letter
if(letterStock[str[i]-65] < 0)
return false;
}
return true;
}
My question thus is: Is this implementation optimal and if not, what is a better one?
Slightly out of the box but have you considered using something like Sqlite for word storage / querying? might solve all your problems
I think a directed acyclic word graph (or DAWG) is what you need. I first used this structure 25 years ago in a Scrabble program (we filched the dictionary from WordPerfect for DOS). Don't give up if you find it difficult to build a DAWG from your list of words -- it's a very satisfying achievement.
Slight optimization but if you know your letter count is like 5 and the number of letters in the word is 6 you don't have to check the word.
How fast do you need your solution to be? Even in the worst case of checking all characters your example only takes about a few milliseconds (2ms in my basic tests) to check all letters in 40,000 words (338,000 letters total). So in one second you could use up a stock of 170 million characters.
I generally prefer a simpler algorithm if it meets my needs than spending time/effort looking for the "best" one that may end up being only marginally better.
A bit of preprocessing might help speed things up for the user.
Have a dictionary in a letters class:
public Dictionary<char, int> count = new Dictionary<char, int> { {'a', 0}, {'b', 0}, etc
Before you let the user at it process your list of words into a dictionary keyed by the word with a value set a letters class:
List<string> strings = new List<string>{ "johnny", "happy", "people" };
Dictionary< string, letters> processedWordList;
void processList()
{
foreach (string s in strings)
{
//there should probably be a constructor in letters which does some of this
string lowered = s.ToLower();
Letters letters;
foreach (char c in lowered)
{
++letters.count[c];
}
processedWordList.Add(s, letters);
}
}
Then you can override the >= operator within the letters class to see if you can spell the word:
public static bool operator >=(LetterCount c1, LetterCount c2)
{
foreach( KeyValuePair<char, int> letter in c1.count )
{
if (letter.Value < c2.count[letter.Key])
return false;
}
return true;
}
And override the operator- in a similar way to remove the letters from your input. Then your main functionality looks like:
foreach( string s in strings )
{
if (input >= processedWordList[s])
{
input = input - processedWordList[s];
}
}
All untested of course.
How big is the set of available characters likely to be? If it's large, I mean if you regularly have, say, 15 or more letters, then a large percentage of the words in your dictionary will likely fit, and a brute force, check each one by one is probably the most efficient algorithm. On the other hand, if your list is mall, if you typically have only 4 or 5 letters in the pool maybe, then most words will not quality and it makes sense to have a way to zoom in on candidate words.
One obvious approach would be to have the dictionary in an indexed table. Take each unique letter from the pool in turn and look for words starting with that letter. For example if the pool contains a,b,b,r,o, look for words starting with a, then b, then r, then o. Beyond that sort of thing, I'm not sure that you'd gain much with more complex data structures or search schemes.
精彩评论