开发者

Why is the highlighting of text so funky in PDFs? [closed]

Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed 11 years ago.

Improve this question

This has happene开发者_JAVA技巧d to me a million times and I'm finally getting around to figuring it out. So many times when I'm highlighting a row of text the highlighted text randomly jumps around, skips lines and skips letters in the middle, for example.. this is (would be highlighted) this isn't (not highlighted) this is again(highlighted), even though they are all on the same line. What gives?


It depends on how the PDF was generated. Some programs generate full lines of text and these are easy to highlight the text because the PDF viewer (Acrobat, etc) knows that the text is linear. Other programs actually write out each glyph (letter) one-by-one. Basically "draw letter a at position (100,100), letter b at (110,99.99)". In this example you'll see the letter b is 10 units more in the x direction and almost exactly the same in the y direction. Almost. Visually they look exactly the same. Mathematically the program has to guess that these are on the same "line" or are "next to each other". Sometimes it gets it right, sometimes it doesn't.

Why do programs write things out letter by letter? When advanced formatting is used (letter spacing, kerning, ligatures, etc) some design programs decide to write out how something should look rather than what it actually is. A graphic designer generally doesn't care that two letters are physically next to each other in the file, they only care about whether they look like they are.

Why do some programs mess up the y coordinate when writing letters next to each other? Remember, they're trying to explain how text should look, not what the actual text is. All fonts have different heights and some design programs might adjust the positioning (just slightly) so that text visually falls in line better.

Lastly, there's no guarantee that text is written linearly within the file, left to right or top to bottom. Some programs might write line 1, then line 3 and then line 2. It looks okay when displayed but its not the same in the file. Why do they do this? Who knows. Maybe line two was indented a bit (or used a letter that caused the position of the ink to be slightly indented) thus a left-to-right scan didn't catch it right away.

Hopefully that helps a bit!


It has to do with how text is represented in PDF files. The PDF format doesn't really have the concept of lines of text (at least in its basic form); it just puts letters (or rather glyphs) on the page at specific positions.

The application that displays the PDF often has to guess the order in which the text is supposed to be read. This can be harder than it sounds for complex multi-column layouts, pull quotes, etc. and it can even be difficult for "regular" text if there are footnotes or fonts with different metrics on the same line.

Some PDFs also represent umlauts and accented characters as multiple glyphs (e.g. a ¨ on top of an "a") in which case it can be difficult to determine to which line the character belongs.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜