OCR: weighted Levenshtein distance

2023-03-07 05:28 问答作者：

I'm trying to create an optical character recognition system with the dictionary.

In fact I don't have an implemented dictionary yet=)

I've heard that there are simple metrics based on Levenstein distance which take in account different distance between different symbols. E.g. 'N' and 'H' are very close to each other and d("THEATRE", "TNEATRE") should be less than d("THEATRE", "TOEATRE") which is impossible using basic Levenstein distance.

Could you 开发者_如何学JAVAhelp me locating such metric, please.

This might be what you are looking for: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance (and kindly some working code is included in the link)

Update:

http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html

Here is an example (C#) where weight of "replace character" operation depends on distance between character codes:

      static double WeightedLevenshtein(string b1, string b2) {
        b1 = b1.ToUpper();
        b2 = b2.ToUpper();

        double[,] matrix = new double[b1.Length + 1, b2.Length + 1];

        for (int i = 1; i <= b1.Length; i++) {
            matrix[i, 0] = i;
        }

        for (int i = 1; i <= b2.Length; i++) {
            matrix[0, i] = i;
        }

        for (int i = 1; i <= b1.Length; i++) {
            for (int j = 1; j <= b2.Length; j++) {
                double distance_replace = matrix[(i - 1), (j - 1)];
                if (b1[i - 1] != b2[j - 1]) {
                    // Cost of replace
                    distance_replace += Math.Abs((float)(b1[i - 1]) - b2[j - 1]) / ('Z'-'A');
                }

                // Cost of remove = 1 
                double distance_remove = matrix[(i - 1), j] + 1;
                // Cost of add = 1
                double distance_add = matrix[i, (j - 1)] + 1;

                matrix[i, j] = Math.Min(distance_replace, 
                                    Math.Min(distance_add, distance_remove));
            }
        }

        return matrix[b1.Length, b2.Length] ;
    }

You see how it works here: http://ideone.com/RblFK

A few years too late but the following python package (with which I am NOT affiliated) allows for arbitrary weighting of all the Levenshtein edit operations and ASCII character mappings etc.

https://github.com/infoscout/weighted-levenshtein

pip install weighted-levenshtein

Also this one (also not affiliated):

https://github.com/luozhouyang/python-string-similarity

I've recently created a python package that does exactly that https://github.com/zas97/ocr_weighted_levenshtein.

In my Weigthed-Levenshtein implementation the distance between "THEATRE" and "TNEATRE" is 1.3 while the distance between "THEATRE" and "TOEATRE" is 1.42.

Other exemples are the d("O", "0") is 0.06 and d("e","c") is 0.57.

This distances have been calculated by running multiple ocrs in a synthetic dataset and doing statistics on the most common ocr errors. I hope it helps someone =)

继续阅读：levenshtein-distance metrics ocr

OCR: weighted Levenshtein distance

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？