How to modify Levenshteins Edit Distance to count "adjacent letter exchanges" as 1 edit
I'm playing around with Levenshteins Edit Distance algorithm, and I want to extend this to count transpositions -- that is, exchanges of adjacent letters -- as 1 edit. The unmodified algorithm counts insertions, deletes or substitutions 开发者_如何学编程needed to reach a certain string from another. For instance, the edit distance from "KITTEN" to "SITTING" is 3. Here's the explanation from Wikipedia:
- kitten → sitten (substitution of 'k' with 's')
- sitten → sittin (substitution of 'e' with 'i')
- sittin → sitting (insert 'g' at the end).
Following the same method, the edit distance from "CHIAR" to "CHAIR" is 2:
- CHIAR → CHAR (delete 'I')
- CHAR → CHAIR (insert 'I')
I would like to count this as "1 edit", since I only exchange two adjacent letters. How would I go about to do this?
You need one more case in the algorithm from Wikipedia:
if s[i] = t[j] then
d[i, j] := d[i-1, j-1]
else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then
d[i, j] := minimum
(
d[i-2, j-2] + 1 // transpose
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + 1 // substitution
)
else
d[i, j] := minimum
(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + 1 // substitution
)
You have to modify how you update the dynamic programming table. In the original algorithm one considers the tails(or heads) of the two words that differ at the most by length one. The update is the minimum of all such possibilities.
If you want to modify the algorithm such that changes in two adjacent locations count as one, the minimum above has to be computed over tails(or heads) that differ by at most two. You can extend this to larger neighborhoods but the complexity will increase exponentially in the size of that neighborhood.
You can generalize further and assign costs that depend on the character(s) deleted, inserted or substituted, but you have to make sure that the cost you assign to a pair-edit is lower than two single edits, otherwise the two single edits will always win.
Let the words be w1 and w2
dist(i,j) = min(
dist(i-2,j-2) && w1(i-1,i) == w2(j-1,j) else
dist(i-1,j-1) && w1(i) == w2(j) else
dist(i,j-1) + cost(w2(j)),
dist(i-1,j) + cost(w1(i)),
dist(i-1,j-1) + cost(w1(i), w2(j)),
dist(i, j-2) + cost(w2(j-1,j)),
dist(i-2, j) + cost(w1(i-1,i)),
dist(i-2,j-2) + cost(w1(i-1,i), w2(j-1,j))
)
What I mean by the &&
is that those lines should be considered only if the conditions are satisfied.
The other answers are implementing the Optimal String Alignment algorithm, not Damerau Levenshtein which I think is what you are describing.
I have a java implementation of OSA with some optimizations here: https://gist.github.com/steveash/5426191
精彩评论