Levenshtein distance cost

2023-03-20 23:50 问答作者：

I am new to haskell and I encountered a performance issue that is so grave that it must be my code and not the haskell platform.

I have a python implementation of the Levenshtein distance (own code) and I passed (or tried to do so) this to haskell. The result is the following:

bool2int :: Bool -> Int
bool2int True = 1
bool2int False = 0

levenshtein :: Eq a => [a] -> [a] -> Int -> Int -> Int
levenshtein u v 0 0 = 0
levenshtein u v i 0 = i
levenshtein u v 0 j = j
levenshtein u v i j = minimum [1 + levenshtein u v i (j - 1),
    1 + levenshtein u v (i - 1) j,
    bool2int (u !! (i - 1) /= v !! (j - 1) ) + levenshtein u v (i - 1) (j - 1) ]

distance :: Eq a => [a] -> [a] -> Int
distance u v = levenshtein u v (length u) (length v)

Now, the difference in execution time for strings of length 10 or more is of various powers of 10 between python and haskell. Also with some rough time measuring (wall clock, as I haven't found a clock() command in haskell so far) it seems that my haskell implementation has not cost O(mn), but some other exorbitantly fast growing cost.

Nota bene: I do not want my haskell implementation to compete speed wise with the python script. I just want it to run in a "sensible" time and not in multiples of the time the whole universe exists.

Questions:

What am I doing wrong, that my implementation is so darn slow?
How to fix it?
Talking about "lazy evaluation": I gather that if levenshtein 开发者_如何学C"cat" "kit" 2 2 is called thrice, it is only calculated once. Is this right?
There must be something built-in for my bool2int, right?
Any other input is highly appreciated if it shoves me ahead on the rough path to mastering haskell.

EDIT: Here goes the python code for comparison:

#! /usr/bin/python3.2
# -*- coding, utf-8 -*-

class Levenshtein:
        def __init__ (self, u, v):
                self.__u = ' ' + u
                self.__v = ' ' + v
                self.__D = [ [None for x in self.__u] for x in self.__v]
                for x, _ in enumerate (self.__u): self.__D [0] [x] = x
                for x, _ in enumerate (self.__v): self.__D [x] [0] = x

        @property
        def distance (self):
                return self.__getD (len (self.__v) - 1, len (self.__u) - 1)

        def __getD (self, i, j):
                if self.__D [i] [j] != None: return self.__D [i] [j]
                self.__D [i] [j] = min ( [self.__getD (i - 1, j - 1) + (0 if self.__v [i] == self.__u [j] else 1),
                                  self.__getD (i, j - 1) + 1,
                                  self.__getD (i - 1, j) + 1] )
                return self.__D [i] [j]

print (Levenshtein ('first string', 'second string').distance)

What am I doing wrong, that my implementation is so darn slow?

Your algorithm has exponential complexity. You seem to be assuming that the calls are being memoized for you, but that's not the case.

How to fix it?

You'll need to add explicit memoization, possibly using an array or some other method.

Talking about "lazy evaluation": I gather that if levenshtein "cat" "kit" 2 2 is called thrice, it is only calculated once. Is this right?

No, Haskell does not do automatic memoization. Laziness means that if you do let y = f x in y + y, then the f x will only be evaluated (once) if the result of the sum is demanded. It does not mean that f x + f x will evaluate in only one call to f x. You have to be explicit when you want to share results from subexpressions.

There must be something built-in for my bool2int, right?

Yes, there is an instance Enum Bool, so you can use fromEnum.

*Main> fromEnum True
1
*Main> fromEnum False
0

Any other input is highly appreciated if it shoves me ahead on the rough path to mastering haskell.

While writing stuff from scratch may be fun and educational, it is important to learn to take advantage of the many great libraries on Hackage when doing common things like this.

For example there is an implementation of the Levenshtein distance in the edit-distance package.

I translated your Haskell code back to Python for comparison:

def levenshtein(u, v, i, j):
    if i == 0: return j
    if j == 0: return i

    return min(1 + levenshtein(u, v, i, (j-1)),
               1 + levenshtein(u, v, (i-1), j),
               (u[i-1] != v[j-1]) + levenshtein(u, v, (i-1), (j-1)))

def distance(u, v):
    return levenshtein(u, v, len(u), len(v))

if __name__ == "__main__":
    print distance("abbacabaab", "abaddafaca")

Even without fixing the O(n) indexing issue that chrisdb pointed out in his answer, this performs slower than the Haskell version when compiled:

$ time python levenshtein.py
6

real    0m4.793s
user    0m4.690s
sys 0m0.020s

$ time ./Levenshtein 
6

real    0m0.524s
user    0m0.520s
sys 0m0.000s

Of course, they both lose to the properly memoized version in the edit-distance package:

$ time ./LevenshteinEditDistance 
6

real    0m0.015s
user    0m0.010s
sys 0m0.000s

Here's a simple memoized implementation using Data.Array:

import Data.Array

distance u v = table ! (m, n)
   where table = listArray ((0, 0), (m, n)) [levenshtein i j | i <- [0..m], j <- [0..n]]

         levenshtein 0 j = j
         levenshtein i 0 = i
         levenshtein i j = minimum [ 1 + table!(i, j-1)
                                   , 1 + table!(i-1, j)
                                   , fromEnum (u'!(i-1) /= v'!(j-1)) + table!(i-1, j-1) ]

         u' = listArray (0, m-1) u
         v' = listArray (0, n-1) v

         m = length u
         n = length v

main = print $ distance "abbacabaab" "abaddafaca"

It performs similarly to your original Python code:

$ time python levenshtein-original.py 
6

real    0m0.037s
user    0m0.030s
sys 0m0.000s
$ time ./LevenshteinArray 
6

real    0m0.017s
user    0m0.010s
sys 0m0.000s

It looks to me like the likely cause is the use of !! for random access. Haskell lists are linked lists, which are poorly suited to algorithms that require random rather than sequential access.

You might want to try replacing the lists with something better suited to random access. If you are interested in strings you could use Data.Text or ByteStrings, which are underlyingly arrays and should be fast. Or you could use something like Data.Vector perhaps.

EDIT: Actually, it looks like Data.Text will have the same issue, since the documentation says indexing is O(n). Probably converting to a vector would be best.

继续阅读：haskell haskell-platform

Levenshtein distance cost

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？