Introduction to OCR
Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It i开发者_开发知识库s also in Hebrew.
I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.
Thank you.
This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:
PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.
...
Usage Example
>>> from pytesser import * >>> image = Image.open('fnord.tif') # Open image object using PIL >>> print image_to_string(image) # Run tesseract.exe on image fnord >>> print image_file_to_string('fnord.tif') fnord
精彩评论