What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

2023-01-14 20:28 问答作者：

I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.

However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling 开发者_运维问答low quality pdfs to provide as-good-as-possible-quality images to OCR.

I have found this page, but I do not know where to start.

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
- PDF can handle transparencies, which PostScript can not.
- PDF can embed TrueType fonts, which Ghostscript can not. etc.pp. Conversion in the direction PS => PDF is not that critical....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

gswin32c.exe ^
  -sDEVICE=pngalpha ^
  -o output/page_%03d.png ^
  -r600 ^
  d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

gs \
  -sDEVICE=jpeg \
  -o output/page_%03d.jpeg \
  -r600 \
  -dJPEGQ=95 \
  /path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.

[*] D'oh! I missed to see your "linux" tag at first...

-density 600 or so should give you what you need.

At least two other tools you may want to consider:

pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.

继续阅读：ghostscript image-processing imagemagick pdf

What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？