Is OCR no longer an issue?
According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it 开发者_运维问答gives no citation.
My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?
Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?
I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.
Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.
Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.
Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.
And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.
精彩评论