开发者

how to remove a character in pdf using pdflib

I want to remove a hidden space in PDF using PDF lib.

When I extract a word "Gregor" in PDF it comes out as "Gre gor", b开发者_Go百科ut I really want it like "Gregor".

What could be reasons for this? Or, how can I avoid those "hidden spaces"?


Many years ago, I worked at Adobe on Acrobat, version 1.0 and later. At the time, I wrote the tools to do searching, highlighting, and copy paste. I'll try to explain why you're probably seeing what you're seeing and why you're probably also SOL (unless you want to hack PDF lib).

In PDF, page contents are represented by a program in an RPN language that is similar to PostScript. It differs in that it is not Turing complete. It lacks loops, reasonable function definition, recursion, etc - thus sidestepping that pesky halting problem. A typical page content program looks something like this:

255 0 0 sc 72 72 m 144 72 l 144 144 l 72 144 l f

which means, set the color to red (255 0 0), move to (72, 72), connect a line to (144, 72), etc and finally fill the path. This creates a red square, one inch on a side with the lower left corner located at 1 inch up and to the right of the bottom of the page.

Now, when you are working with text, it's a little more complicated. There are four operators to draw text, Tj, ', " and TJ. They mostly differ in how they affect placement of text either before or after the operator is applied. Nonetheless, in a sane world, you would expect your document to have something like this in the content stream:

BT 72 288 Td (Gregor) Tj ET

which means begin text, move the text position to (72, 288), place the text "Gregor", and end text.

Likely, this is not the case. Instead, your document probably looks more like this:

BT 72 288 Td (Gre) Tj --stuff-- 88 288 Td (gor) Tj ET

where --stuff-- is zero or more other PDF operators. PDF is a page description language, not a text file format. Therefore, PDF doesn't dictate how you should lay out the content stream for creating a page. In fact, there are an infinite number of ways to generate equivalent/identical pages.

So, the author of any chunk of code that purports to extract text from a PDF document, should take some time to very clearly answer the question, "what is a word?" If that isn't answered well first, then you're never going to have any kind of reasonable text extraction. While I don't know specifically, I highly suspect that pdflib's definition of a word is "any white space delimited substring from a text placement operator." This definition will get you maybe 80% of the way there. Maybe more, but not much. It is a nearly trivial definition to implement, but it will fail if words are not laid down with single text placement operators. Heck, there are even PDF pages where the text isn't laid down anywhere close to reading order. For example, troff (at least used to) lay out all the plain text first, then the italic text, then the bold text.

Then you have to think about the problem in a different way. What if you define a word to be an ordered sequence of glyphs that are close to each other in physical space and similar in size? Then you find that definition would completely encompass the previous definition's success cases and also correctly include a huge number of the previous failures that are inherent in the previous "what is a word" definition. You also find that the actual implementation of that definition in code is significantly more difficult. While the first definition can be done in about an hour's time, this definition is more like weeks or months of time to really get right, because you have to answer the questions "what is close?" and "what is similar in size?" And while you're at it, you need to consider other things like text encoding, ligatures, discretionary hyphens, text laid along a curve (I can't tell you how happy I was when Acrobat was capable of finding words in maps).

So the conclusion you should draw from this is that extracting text from PDF is non-trivial and you should expect a great number of failures from trivially written code.


Read pdf line by line and replace "Gre gor" with "Gregor".


I highly recommend you look at PdfTextStream. They have done the hard work described in plinth's post.

http://www.snowtide.com/

They aim to have the most natural (what a human reader would expect a word to be) definition of a word.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜