How to find that two words differ by how much distance>> Is there any shortest way for this
I have read about Levenshtein distance about the calculation of the distance between the two distinct words.
I have one source string and i have to match it with all 10,000 target words. The closest word should be returned.
The problem is I have given a 开发者_开发问答list of 10,000 target words, and input source words is also huge.... So what shortest and efficient algorithm to apply here. Levenshtein distance calculation for each n every combination(brute force logic) would be very time consuming.
Any hints, or ideas are most welcome.
I guess it depends a little on how the words are structured. For example this guy improved the implementation based on the fact that he processes his words in order and does not repeat calculations for common prefixes. But if all your 10,000 words are totally different that won't do you much good. It's written in python so might be a bit of work involved to port to C.
There are also some kinda homebrew algorithms out there (with which I mean there is no official paper written about it) but that might do the trick.
There's two common approaches for this, and I've blogged about both. The simpler one to implement is BK-Trees - a tree datastructure that speeds lookup based on levenshtein distance by only searching relevant parts of the tree. They'll probably be perfectly sufficient for your use-case.
A more complicated but more efficient approach is Levenshtein Automata. This works by constructing an NFA that recognizes all words within levenshtein distance x of your target string, then iterating through it and the dictionary in lockstep, effectively performing a merge join on them.
精彩评论