Nearest neighbor search in Levenstein-distance-like metric
I have a set of words (a 'dictionary'), and I have to find the closest word from the dictionary, given a new word. (I am using 'word' as a keyword, as it is actually a variable length sequence of abstract 'letters').
I am using a generalization of the Levenstein distance as a metric - the reason I needed to generalize is that I need specific 'cost' of exchanging two given letters - for example, I need the exchange of 'a' with 'b' to cost less from the exchange of 'a' with 'c'. I guess I still have to convince myself that my generalization is still a metric.
Currently I am using the naive linear search, i.e. iterating over all words in the dictionary and keeping track of the smallest distance, and I am looking for a more efficient method.
I started reading about methods for nearest neighbor search, but the main conceptual difficulty for me is that my 'points' (words) are not embedded in a space I can visualize, and they are not vectors with dimensionality etc.
With that in mind, I would like to hear some advice regarding 开发者_运维技巧which algorithms to look for.
Let me re-verbalize your question, and give you a possible answer. Without seeing your data set, I don't know which would be better for you.
You already have an algorithm that, given two words, gives a distance between them. It is based on the Levenstein distance for a path between those words, with a few modifications to the costs. And you want to find the closest word to a given word without having to search the whole dictionary.
The simplest thing that I would try is to start with your word, and search through all possible sets of modifications until you find the closest word in your dictionary. You want a modified breadth-first search. Store (0, your_word)
as the only entry in some sort of http://en.wikipedia.org/wiki/Priority_queue (a heap is easy to implement), grab the distance to a random dictionary word as your current best solution and then as long as the priority queue is not empty:
Take the lowest cost element out.
If it is more expensive than your best solution:
stop, return your best.
For each possible one step modification of that word:
if the new word is in the dictionary and is lower cost than your best:
improve best estimate
else:
store (new_cost, new_word) in the priority queue
This will cause an exponentially growing search set starting with your original word. But if there is a nearby word in the dictionary, it should find that fairly quickly. If you go this route you may wish to put an upper bound on its search space after which you give up.
This may be far from an optimal solution, but it shouldn't be too hard to program and try.
精彩评论