Preserve "long" spaces in PDFBox text extraction

2023-02-03 09:25 问答作者：

I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other

This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).

I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.

Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.

The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with开发者_JAVA百科 PDFBox, that would be perfect.

There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.

Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.

PDF text extraction is difficult.

If the text was output as one big string separated by spaces such as :-

PDFTextOut("     Column 1                    Column 2           Column 3");

and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.

In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.

PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");

In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.

继续阅读：pdf pdfbox text-extraction whitespace

Preserve "long" spaces in PDFBox text extraction

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？