CGPDFScanner, Identity-H and decompression
My instance of CGPDFScanner
is scanning a test pdf file.
At a given time, the current font dictionary has Encoding
value Identity-H
and a FontDescriptor
dictionary with key FontFile2
. This key happens to be for a stream value, whose dictionary has the key Filter
. The value for this key is Fla开发者_Go百科teDecode
.
I'm unsure of how to interpret and use this (to, say, extract the text in the next Tj
block to Unicode). For example, do I just zlib-decompress the bytes in the next Tj
block? (There is no ToUnicode
key here.)
I'd thought all the decompression was carried out by the instance of CGPDFScanner
.
If the font uses Identity-H encoding and it does not have a ToUnicode entry, the text cannot be extracted. The parameter of Tj operator is a sequence of glyph indexes and this sequence cannot be converted to text in the absence of the ToUnicode entry.
The FontFile2 entry stores the actual font file, it has no role when extracting text from the PDF file.
精彩评论