OCR error correction: How to combine three erroneous results to reduce errors

2023-01-16 03:49 问答作者：

The problem

I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad). I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more. Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words. I am on Linux. Preferred language would be Python.

What I have so far

Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.

An example might look in the following way:

Xorem_ipsum
lorXYm_ipsum
lorem_ipuX

A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.

In cases like this I try to combine the different results. Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example

or m_ipsum
lor m_ip u
orem_ip u

But here I am stuck now. I am not able to combine those pieces to a result.

The questions

Do 开发者_JAVA技巧you have

an idea how to combine the different common longest substrings?
Or do you have a better idea how to solve this problem?

It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.

Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.

http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.

http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux

http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.

All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.

https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.

Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.

Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.

Maybe repeat the "longest common substring" until all results are the same. For your example, you would get the following in the next step:

or m_ip u
or m_ip u
or m_ip u

OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.

So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.

I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.

I afforded a very similar problem. I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001

See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon

继续阅读：algorithm error-correction ocr

OCR error correction: How to combine three erroneous results to reduce errors

The problem

What I have so far

The questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The problem

What I have so far

The questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？