开发者

A Java Library for text extraction from PDF documents preserving empty spaces and lines [closed]

Closed. This question do开发者_JAVA技巧es not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

do you know a Java library, with which I can extract the text of a PDF document as a string, and which also preserves all empty lines and empty spaces from the original document (as they appear in the pdf document)?

I am using right now the PDFTextStripper class from the PDFBox-0.7.3 library, and I use the getText() method, which does return the document as a string, however, it removes also all empty lines, tabs and any empty spaces between the text. The new lines are preserved, so I can recognize the structure of the document, however, it is important for me to keep the other empty stuff as well. This is the default behaviour of getText(), and it seems that it is not possible to make it work so that it preserve the empty pieces of the text (I could not find any method in the API for this purpose).

Thank you for your help.


Are you sure there are line feeds, tabs, space characters in the document? Many of the PDFs I've encountered used positioning for spacing and indentation. So rather than include line feeds and tabs, the text object is simply placed further down the page and offset. In that case PDFBox isn't removing anything from the text, the spaces were never there.

If you haven't looked at the PDF source yet, that could be helpful. If it's compressed you can use Multivalent Uncompress to make it readable. The PDF specification describes the text-positioning operators in section 9.4.2.


I had the same problem and solved it by extending the TextStripper class and adding coordinates in front of every line (was not easy though). For your problem you may add coordinates to every word, e.g. by not returning Strings, but a List of own objects (class with the word, x and y). So you would be able to reconstruct tabs and multible spacings from the coordinates afterwards.

Greetz, GHad


You might want to try our PDFTextStream library. We try very hard to maximize the the fidelity of the text extracted by PDFTextStream relative to its displayed presentation, so spacing and such are maintained as much as possible. There are also a couple of optional extraction modes (different implementations of the OutputHandler interface, actually) that allow you to control how the extracted text is formatted, which certainly affects spacing and such.


Might want to take a look at iText. The PDFReader class looks useful.


You can also use JPedal for text extraction. It may well be there are no spaces in the text - remember PDF is a display format...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜