best way to export data from pdfs

2022-12-22 11:46 问答作者：

Hi i work at a news paper and we are lookin a way to make archieve material available. Atm our pages come in pdf format so we need a way to export text and images from the pdf so that they can be added to a database. We've had a look at the News studio plugin for Adobe Acrobat fr开发者_StackOverflow中文版om Iceni Technology, but just wondering if anyone else knows other options for exporting pdf data. thanks

There is pdftotext (part of xpdf). It will extract text from PDF files (if it is stored as text in the PDF, and not as an image). You could probably use that.

However, be advised that any solution to extract text from PDF will be limited, as PDFs are really for display only. At the very least, you will not have metadata like article date, author etc.; also, if part of the text is in an image, you might lose that.

The better approach is probably to extract the raw data from the system which generates the PDFs, and archive that in a suitable format. Maybe more work, but better results.

If your pdfs already contain the text, then your job will be much easier: tools like pdftotext and pdftohtml will give you image and text output (see the Ubuntu package xpdf-utils).

On the other hand, if the text in your pdf is image-based then you'll have to look at OCR options. Fortunately, there are some good open source offerings. I have had some success using a combination of ImageMagick and Tesseract:

First, convert PDFs to TIFF with ImageMagick (Tesseract won't OCR PDFs)
OCR the TIFF using Tesseract (you can also try gocr, also available in the Ubuntu repos)

The key was to make sure the TIFFs were high enough enough quality. These ImageMagick settings worked well for me:

convert -depth 8 -density 500 -colorspace GRAY -resize 1600 input.pdf output.tif

If you need to extract metadata from a pdf as well (Title, Location, Subject, Author, etc.) then pdftk is a useful tool.

继续阅读：database export pdf

best way to export data from pdfs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？