开发者

speeding up tessearct

I have been using tesseract (Ver 3) on linux to extract text from scanned pdf files. The problem that the whole process is slow, very slow. For example, extracting this (http://www.a-pdf.com/scan-paper/a-pdf-scan-paper-doc.pdf) 20 page document takes 514 seconds (8+ min)

to convert the pdf I used Image Magick convert application. bellow the set command that I use.

convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif

tesseract tmp.tif out.txt

Note, that that 288 dpi is required since otherw开发者_StackOverflowise tesseract fails completely in extracting text from the scaned file that I tested.

Does any one know how I can speed things up without effect the quality of the result?


Try VietOCR to see if it could produce faster results as you want. It can accept PDF if Ghostscript is installed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜