Text searching PDF

2023-03-06 12:14 问答作者：

When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8)开发者_开发百科 representation?

If I need a CMap for this, how do I create (or retrieve) and apply the CMap?

You'll probably have to parse the font data itself. Identity-H just means "use the bytes as raw glyph indexes into the given font". That's why you MUST embed fonts when using Identity-H... different versions of the same font need not have the same glyph order.

There's example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I'm biased).

You'd mentioned a CMap. Identity-H fonts can have a CMap but aren't required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren't all that complex:

/CIDInit /ProcSet findresource begin  
12 dict begin  
begincmap  
/CIDSystemInfo  
<< /Registry (TTX+0)  
/Ordering (T42UV)  
/Supplement 0  
>> def  
/CMapName /TTX+0 def  
/CMapType 2 def
1 begincodespacerange  
<0000><FFFF>  
endcodespacerange  
80 beginbfrange  
<0003><0003><0020>  
<0024><0024><0041>  
<0025><0025><0042>  
<0026><0026><0043>  
<0027><0027><0044>  
<0028><0028><0045>  
<0029><0029><0046>  
<002a><002a><0047>  
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003b><003b><0058>
<003c><003c><0059>
<003d><003d><005a>
<0065><0065><00c9>
<00c8><00c8><00c1>
<00cb><00cb><00cd>
<00cf><00cf><00d3>
<00d2><00d2><00da>
<00e2><00e2><0160>
<00e4><00e4><017d>
<00e9><00e9><00dd>
<00fd><00fd><010c>
<0104><0104><0104>
<0106><0106><010e>
<0109><0109><0118>
<010b><010b><011a>
<0115><0115><0147>
<011b><011b><0158>
<0121><0121><0164>
<0123><0123><016e>
<01a0><01a0><0116>
<01b2><01b2><012e>
<01cb><01cb><016a>
<01cf><01cf><0172>
<022c><022c><0401>
<023b><023b><0411>
<023c><023c><0412>
<023d><023d><0413>
<023e><023e><0414>
<023f><023f><0415>
<0240><0240><0416>
<0241><0241><0417>
<0242><0242><0418>
<0243><0243><0419>
<0244><0244><041a>
<0245><0245><041b>
<0246><0246><041c>
<0247><0247><041d>
<0248><0248><041e>
<0249><0249><041f>
<024a><024a><0420>
<024b><024b><0421>
<024c><024c><0422>
<024d><024d><0423>
<024e><024e><0424>
<024f><024f><0425>
<0250><0250><0426>
<0251><0251><0427>
<0252><0252><0428>
<0253><0253><0429>
<0254><0254><042a>
<0255><0255><042b>
<0256><0256><042c>
<0257><0257><042d>
<0258><0258><042e>
<0259><0259><042f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

Wow. That particular CMap is horribly inefficient. A "bfrange" starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.

For example:

<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>

could be represented as

<0003><0003><0020>
<0024><0032><0041>

A quick google search turned up the CMap/CID font spec.

There are also beginbfchar/endbfchar which just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe's character ID tables. They're part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they're called)), and various other stuff you really out to read that spec to find out about.

There are multiple ways this data may be encoded (some using CMAPs). You can also have custom encodings (http://www.jpedal.org/PDFblog/2011/04/understanding-the-pdf-file-format-%E2%80%93-custom-font-encodings/). You also need to understand CID fonts (http://www.jpedal.org/PDFblog/2011/03/understanding-the-pdf-file-format-%E2%80%93-what-are-cid-fonts/)

继续阅读：pdf

Text searching PDF

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon