开发者

Perl CAM::PDF splitting words improperly

I'm using the CAM::PDF Perl module to parse PDFs. The module works great except for one issue, it seems to split words randomly. Is there any way of fixing this via settings or some algorithmic way to put the words back together?

For example:

"has offices located in New Yor k and Dublin." -Notice New York

"price competit ion" -price competition

The section of code is below:

    $pdf = CAM::PDF->new($pdf_name);    
    $text = $pdf->g开发者_如何学运维etPageText($page);
    print("$text\n");

;


In general it's not always possible to reconstruct the original text from a PDF. Often the physical structure doesn't match the output.

In this case you are quite possibly being affected by manual kerning. I.e. splitting on character pairs and adjusting the spacing to produce a more pleasing result - see http://en.wikipedia.org/wiki/Kerning.

So breaking within words and outputting smaller chunks, which is being recognised by CAM::PDF as separate words.

If you have some control on your PDF production, you could experiment with fonts and kerning settings - but this might also compromise output quality.

PDF::OCR2 is likely to handle kerning more robustly and might do a better overall job of recognizing the original text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜