Extracting text from PDF with Poppler (C++)

2022-12-28 17:30 问答作者：

I'm trying to get my way through Poppler and its (lack of) documentation.

What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here.

So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rectan开发者_JAVA技巧gle, which is not very handy. Isn't there just a very simple function that would output the PDF text in order (maybe line by line?).

You should be able to set the selection rectangle to the pageSize/MediaBox of the page and get all the text.

I say should because before you start wondering why you get surprised by the output of poppler_page_get_text, you should be aware of how text gets laid out on a page. All graphics are laid out on a page using a program expressed in post-fix notation. To render the page, this program is executed on a blank page.

Operations in the program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves and so on. Text is laid out by a series of text operators that are always bracketed by BT (begin text) and ET (end text). How or where text is placed on a page is at the sole discretion of the software that generates the PDF. For example, for print drivers, the code responds to GDI calls for DrawString and translates that into text drawing operations.

If you are lucky, the text on the page is laid out in a sane order with sane font usage, but many programs that generate PDF aren't so kind. Psroff, for example liked to place all the plain text first, then the italic text, then the bold text. Words may or may not be placed in reading order. Fonts may be re-encoded so that 'a' maps to '{' or whatever. Then you might have ligatures where multiple characters are replaced by single glyphs - the most common ones are ae, oe, fi, fl, and ffl.

With all of this in place, the process of extracting text is decidedly non-trivial, so don't be surprised if you see poor quality results from text extraction.

I used to work on the text extraction tools in Acrobat 1.0 and 2.0 - it's a real challenge to get right.

Just for the records, I am using poppler right now with this little program

#include <iostream>

#include "poppler-document.h"
#include "poppler-page.h"
using namespace std;

int main()
{
    poppler::document *doc = poppler::document::load_from_file("./CMI2APIDocV1.4.pdf");
    const int pagesNbr = doc->pages();
    cout << "page count: " << pagesNbr << endl;

    for (int i = 0; i < pagesNbr; ++i)
        cout << doc->create_page(i)->text().to_latin1().c_str() << endl;
}

// g++ -I/usr/include/poppler/cpp/ -c poppler.cpp
// g++ -I/usr/include/poppler/cpp poppler.o  /usr/lib/x86_64-linux-gnu/libpoppler-cpp.a /usr/lib/x86_64-linux-gnu/libpoppler.a /usr/lib/x86_64-linux-gnu/liblcms2.so     /usr/lib/x86_64-linux-gnu/libfontconfig.a /usr/lib/x86_64-linux-gnu/libjpeg.a /usr/lib/x86_64-linux-gnu/libfreetype.a     /usr/lib/x86_64-linux-gnu/libexpat.a /usr/lib/x86_64-linux-gnu/libz.a

I am quite happy with th result so far, except for arrays and "spreadsheet" restitution in pure text, where sometime a single cell may span through multiple lines. (if someone knows how to avoid that ?)

继续阅读：pdf poppler text-extraction

Extracting text from PDF with Poppler (C++)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？