Queryable Language Dictionary and Word Searching Functionality
In a future project I will need to implement functionality meant for searching words (either by length or given a set of characters and their position in the word) which will return all words that meet a certain criteria.
In order to do so, I will need language dictionaries that can be easily queryable in LINQ. The first thing I'd like to ask is if anyone knows about good dictionaries to use in t开发者_开发百科his kind of application and environment used.
And I'd also like to ask about good ways to search the said dictionary for a word. Would a hash table help speeding up queries? The thing is that a language dictionary can be quite huge, and knowing I will have plenty of search criteria, what would be a good way to implement such functionality in order to avoid hindering the search speed?
Without knowing the exact set of stuff you are likely to need to optimize for, it's hard to say. The standard data structures for efficiently organizing a large corpus of words for fast retrieval is the "trie" data structure, or, if space efficiency is important (because say you're writing a program for a phone, or other memory-constrained environment) then a DAWG -- a Directed Acyclic Word Graph. (A DAWG is essentially a trie that merges common paths to leaves.)
Other interesting questions that I'd want to know an answer to before designing a data structure are things like: will the dictionary ever change? If it does change, are there performance constraints on how fast the new data needs to be integrated into the structure? Will the structure be used only as a fast lookup device, or would you like to store summary information about the words in it as well? (If the latter then a DAWG is unsuitable, since two words may share the same prefix and suffix nodes.) And so on.
I would search the literature for information on tries, DAWGs and ways to optimize Scrabble programs; clearly Scrabble requires all kinds of clever searching of a corpus of strings, and as a result there have been some very fast variants on DAWG data structures built by Scrabble enthusiasts.
I have recently written an immutable trie data structure in C# which I'm planning on blogging about at some point. I'll update this answer in the coming months if I do end up doing that.
精彩评论