Nearest neighbor search in Levenstein-distance-like metric

2023-03-01 00:47 问答作者：

I have a set of words (a 'dictionary'), and I have to find the closest word from the dictionary, given a new word. (I am using 'word' as a keyword, as it is actually a variable length sequence of abstract 'letters').

I am using a generalization of the Levenstein distance as a metric - the reason I needed to generalize is that I need specific 'cost' of exchanging two given letters - for example, I need the exchange of 'a' with 'b' to cost less from the exchange of 'a' with 'c'. I guess I still have to convince myself that my generalization is still a metric.

Currently I am using the naive linear search, i.e. iterating over all words in the dictionary and keeping track of the smallest distance, and I am looking for a more efficient method.

I started reading about methods for nearest neighbor search, but the main conceptual difficulty for me is that my 'points' (words) are not embedded in a space I can visualize, and they are not vectors with dimensionality etc.

With that in mind, I would like to hear some advice regarding 开发者_运维技巧which algorithms to look for.

Let me re-verbalize your question, and give you a possible answer. Without seeing your data set, I don't know which would be better for you.

You already have an algorithm that, given two words, gives a distance between them. It is based on the Levenstein distance for a path between those words, with a few modifications to the costs. And you want to find the closest word to a given word without having to search the whole dictionary.

The simplest thing that I would try is to start with your word, and search through all possible sets of modifications until you find the closest word in your dictionary. You want a modified breadth-first search. Store (0, your_word) as the only entry in some sort of http://en.wikipedia.org/wiki/Priority_queue (a heap is easy to implement), grab the distance to a random dictionary word as your current best solution and then as long as the priority queue is not empty:

Take the lowest cost element out.
If it is more expensive than your best solution:
    stop, return your best.
For each possible one step modification of that word:
    if the new word is in the dictionary and is lower cost than your best:
        improve best estimate
    else:
        store (new_cost, new_word) in the priority queue

This will cause an exponentially growing search set starting with your original word. But if there is a nearby word in the dictionary, it should find that fairly quickly. If you go this route you may wish to put an upper bound on its search space after which you give up.

This may be far from an optimal solution, but it shouldn't be too hard to program and try.

继续阅读：algorithm metrics nearest-neighbor

Nearest neighbor search in Levenstein-distance-like metric

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？