Why is the highlighting of text so funky in PDFs? [closed]

2023-04-03 19:59 问答作者：

Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed 11 years ago.

Improve this question

This has happene开发者_JAVA技巧d to me a million times and I'm finally getting around to figuring it out. So many times when I'm highlighting a row of text the highlighted text randomly jumps around, skips lines and skips letters in the middle, for example.. this is (would be highlighted) this isn't (not highlighted) this is again(highlighted), even though they are all on the same line. What gives?

It depends on how the PDF was generated. Some programs generate full lines of text and these are easy to highlight the text because the PDF viewer (Acrobat, etc) knows that the text is linear. Other programs actually write out each glyph (letter) one-by-one. Basically "draw letter a at position (100,100), letter b at (110,99.99)". In this example you'll see the letter b is 10 units more in the x direction and almost exactly the same in the y direction. Almost. Visually they look exactly the same. Mathematically the program has to guess that these are on the same "line" or are "next to each other". Sometimes it gets it right, sometimes it doesn't.

Why do programs write things out letter by letter? When advanced formatting is used (letter spacing, kerning, ligatures, etc) some design programs decide to write out how something should look rather than what it actually is. A graphic designer generally doesn't care that two letters are physically next to each other in the file, they only care about whether they look like they are.

Why do some programs mess up the y coordinate when writing letters next to each other? Remember, they're trying to explain how text should look, not what the actual text is. All fonts have different heights and some design programs might adjust the positioning (just slightly) so that text visually falls in line better.

Lastly, there's no guarantee that text is written linearly within the file, left to right or top to bottom. Some programs might write line 1, then line 3 and then line 2. It looks okay when displayed but its not the same in the file. Why do they do this? Who knows. Maybe line two was indented a bit (or used a letter that caused the position of the ink to be slightly indented) thus a left-to-right scan didn't catch it right away.

Hopefully that helps a bit!

It has to do with how text is represented in PDF files. The PDF format doesn't really have the concept of lines of text (at least in its basic form); it just puts letters (or rather glyphs) on the page at specific positions.

The application that displays the PDF often has to guess the order in which the text is supposed to be read. This can be harder than it sounds for complex multi-column layouts, pull quotes, etc. and it can even be difficult for "regular" text if there are footnotes or fonts with different metrics on the same line.

Some PDFs also represent umlauts and accented characters as multiple glyphs (e.g. a ¨ on top of an "a") in which case it can be difficult to determine to which line the character belongs.

继续阅读：pdf

Why is the highlighting of text so funky in PDFs? [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon