Using Java PDFBox library to write Russian PDF
I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It seems the problem is in the font used, but i am not so sure about that, so i hope if anyone could guide me through this. Here is the important code lines :
PDTrueTypeFont font = PDTrueTypeFont.loadTTF( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text.
font.setEncoding( new WinAnsiEncoding() ); // Define the Encoding used in writing.
// Some code here to open the PDF & define a new page.
contentStream.drawString( "отделом компьютерной" ); // Write the Russian text.
The WinAnsiEncoding source code is : Click here
--------------------- Edit on 18 November 2009
After some investigation, i am now sure it is an Encoding problem, this could be solved by defining my own Encoding using the helpful PDFBox class called DictionaryEncoding.
I am not sure how to use it, but here is what i have tried until now :
COSDictionary cosDic = new COSDictionary();
cosDic.setString( COSName.getPDFName("Ercyrillic"), "0420 " ); // Russian letter.
font.setEncodi开发者_C百科ng( new DictionaryEncoding( cosDic ) );
This does not work, as it seems i am filling the dictionary in a wrong way, when i write a PDF page using this it appears blank.
The DictionaryEncoding source code is : Click here
The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity.
- a Font dictionary of Subtype Type0 with
- a DescendantFonts array with an entry described below
- a ToUnicode entry that maps UTF16BE values to unicode
- an Encoding set to Identity-H
Output from one of my unit tests on my own tools looks like this:
13 0 obj
<<
/BaseFont /DejaVuSansCondensed
/DescendantFonts [ 4 0 R ]
/ToUnicode 14 0 R
/Type /Font
/Subtype /Type0
/Encoding /Identity-H
>> endobj
14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1
beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end
endstream % note that the formatting is wrong for the stream
- a Font dictionary of Subtype CIDFontTYpe2 with
- a CIDSsytemInfo
- a FontDescriptor
- DW and W
- a CIDToGIDMap that maps from character ID to glyph ID
Here's the one from the same test - this is the object in the DescendantFonts array:
4 0 obj
<<
/Subtype /CIDFontType2
/Type /Font
/BaseFont /DejaVuSansCondensed
/CIDSystemInfo 8 0 R
/FontDescriptor 9 0 R
/DW 1000
/W 10 0 R
/CIDToGIDMap 11 0 R
>>
8 0 obj
<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>>
endobj
Why am I telling you this? What does it have to do with PDFBox? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Later Unicode support was added, but it really felt like it was glommed on. One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. No such luck. If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. I swear, I am not making this up.
At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. Use anything else, you get all this other stuff with it. I'd be surprised if PDFBox does that work for you - it is a serious hassle.
Try to use this construction:
PDFont font = PDType0Font.load( pdfFile, new File( "fonts/VREMACCI.TTF" ) ); // Windows Russian font imported to write the Russian text.
// Some code here to open the PDF & define a new page.
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.showText( "отделом компьютерной" ); // Write the Russian text.
contentStream.endText();
The solution is very Simple.
1) You must find fonts compatible with the characters you want to display.
2) Download locally the .ttf file of the fonts.
3) Load fonts from your application
For Example this is what you have to do in case you want to use Greek characters:
content = new PDPageContentStream(document, page);
pdfFont = PDType0Font.load( document, new File( "arialuni.ttf" ) )
content.setFont(pdfFont, fontSize);
Perhaps the Russian encoding class need to be written, it should look like the WinAnsiEncoding one, I suppose.
Now, I have no idea what to put there!
Or, if that's not what you do already, perhaps you should encode your source file in UTF-8 and use a default encoding.
I saw some messages related to issues with extracting Russian text from existing PDF files (using PDFBox of course) but I don't know if output is related.
You can also write to the PDFBox mailing list.
Testing whether this is an encoding issue should be pretty easy to do (just switch to UTF16 encoding).
I'm assuming that you've tried using an editor or something with the VREMACCI font and confirmed that it displays the way you expect it to?
You might want to try doing the same thing in iText just to get a feel for whether the issue is related to the PdfBox library itself... If your primary goal is to generate PDF files, iText might be a better solution anyway.
EDIT - long answer to comments:
ok - sorry for the back and forth on the encoding question... Your core issue (which you probably already knew) is that the encoding of the bytes being written to the content stream is different than the encoding being used to look up glyphs. Now I'll try to actually be helpful:
I took a look at the dictionary encoding class in PdfBox, and it looks quite unintuitive... The 'dictionary' in question is a PDF dictionary. So what you'll basically need to do is create a Pdf dictionary object (I think that PdfBox calls this a type of COSObject), then add entries to it.
The encoding for a font is defined in PDF as a dictionary (see page 266 of the above spec). The dictionary contains a base encoding name, plus an optional differences array. Technically, the differences array should not be used with true-type fonts (although I've seen it used in some cases - don't use it, though).
You will then specify an entry for the cmap for the encoding. This cmap will be the encoding of your font.
My suggestion here is to take an existing PDF that does what you want, then get a dump of the dictionary structure for the font so you can see what it looks like.
This is definitely not for the faint of heart. I can provide some help - if you need a dictionary dump, shoot me a hyperlink with a sample PDF and I'll run it through some of the algorithms I use in my iText development (I'm the maintainer of the iText text extraction sub-system).
EDIT - 11/17/09
OK - here's the dictionary dump from the russian.pdf file (sub-dictionaries are listed indented, and in the order they appeared in the containing dictionary):
(/CropBox=[0, 0, 595, 842], /Parent=Dictionary of type: /Pages, /Type=/Page, /Contents=[209 0 R, 210 0 R, 211 0 R, 214 0 R, 215 0 R, 216 0 R, 222 0 R, 223 0 R], /Resources=Dictionary, /MediaBox=[0, 0, 595, 842], /StructParents=0, /Rotate=0)
Subdictionary /Parent = (/Type=/Pages, /Count=6, /Kids=[195 0 R, 1 0 R, 3 0 R, 5 0 R, 7 0 R, 9 0 R])
Subdictionary /Resources = (/ExtGState=Dictionary, /ProcSet=[/PDF, /Text], /ColorSpace=Dictionary, /Font=Dictionary, /Properties=Dictionary)
Subdictionary /ExtGState = (/GS0=Dictionary of type: /ExtGState)
Subdictionary /GS0 = (/OPM=1, /op=false, /Type=/ExtGState, /SA=false, /OP=false, /SM=0.02)
Subdictionary /ColorSpace = (/CS0=[/ICCBased, 228 0 R])
Subdictionary /Font = (/C2_1=Dictionary of type: /Font, /C2_2=Dictionary of type: /Font, /C2_3=Dictionary of type: /Font, /C2_4=Dictionary of type: /Font, /TT2=Dictionary of type: /Font, /TT1=Dictionary of type: /Font, /TT0=Dictionary of type: /Font, /C2_0=Dictionary of type: /Font, /TT3=Dictionary of type: /Font)
Subdictionary /C2_1 = (/DescendantFonts=[243 0 R], /BaseFont=/LDMIEC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
Subdictionary /C2_2 = (/DescendantFonts=[233 0 R], /BaseFont=/LDMIBO+TimesNewRomanPSMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
Subdictionary /C2_3 = (/DescendantFonts=[224 0 R], /BaseFont=/LDMIHD+TimesNewRomanPS-ItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
Subdictionary /C2_4 = (/DescendantFonts=[229 0 R], /BaseFont=/LDMIDA+Tahoma, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
Subdictionary /TT2 = (/LastChar=58, /BaseFont=/LDMIFC+TimesNewRomanPS-BoldMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 333], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=136, /Descent=-216, /FontWeight=700, /FontBBox=[-558, -307, 2000, 1026], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMIFC+TimesNewRomanPS-BoldMT, /Ascent=891, /ItalicAngle=0)
Subdictionary /TT1 = (/LastChar=187, /BaseFont=/LDMICP+TimesNewRomanPSMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 833, 778, 0, 333, 333, 0, 0, 250, 333, 250, 278, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 278, 278, 0, 564, 0, 444, 0, 722, 667, 667, 722, 611, 556, 0, 722, 333, 389, 0, 611, 889, 722, 722, 556, 0, 667, 556, 611, 0, 722, 944, 0, 722, 0, 333, 0, 333, 0, 500, 0, 444, 500, 444, 500, 444, 333, 500, 500, 278, 0, 500, 278, 778, 500, 500, 500, 0, 333, 389, 278, 500, 500, 722, 0, 500, 444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=82, /Descent=-216, /FontWeight=400, /FontBBox=[-568, -307, 2000, 1007], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=34, /XHeight=0, /FontFamily=Times New Roman, /FontName=/LDMICP+TimesNewRomanPSMT, /Ascent=891, /ItalicAngle=0)
Subdictionary /TT0 = (/LastChar=55, /BaseFont=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[250, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 250, 0, 500, 500, 500, 0, 0, 0, 0, 500], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=116.867004, /Descent=-216, /FontWeight=700, /FontBBox=[-547, -307, 1206, 1032], /CapHeight=656, /FontFile2=Stream, /FontStretch=/Normal, /Flags=98, /XHeight=468, /FontFamily=Times New Roman, /FontName=/LDMIBN+TimesNewRomanPS-BoldItalicMT, /Ascent=891, /ItalicAngle=-15)
Subdictionary /C2_0 = (/DescendantFonts=[238 0 R], /BaseFont=/LDMHPN+TimesNewRomanPS-BoldItalicMT, /Type=/Font, /Subtype=/Type0, /Encoding=/Identity-H, /ToUnicode=Stream)
Subdictionary /TT3 = (/LastChar=169, /BaseFont=/LDMIEB+Tahoma, /Type=/Font, /Subtype=/TrueType, /Encoding=/WinAnsiEncoding, /Widths=[313, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 546, 0, 546, 0, 0, 546, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 929], /FontDescriptor=Dictionary of type: /FontDescriptor, /FirstChar=32)
Subdictionary /FontDescriptor = (/Type=/FontDescriptor, /StemV=92, /Descent=-206, /FontWeight=400, /FontBBox=[-600, -208, 1338, 1034], /CapHeight=734, /FontFile2=Stream, /FontStretch=/Normal, /Flags=32, /XHeight=546, /FontFamily=Tahoma, /FontName=/LDMIEB+Tahoma, /Ascent=1000, /ItalicAngle=0)
Subdictionary /Properties = (/MC0=Dictionary of type: /OCMD)
Subdictionary /MC0 = (/Type=/OCMD, /OCGs=Dictionary of type: /OCG)
Subdictionary /OCGs = (/Usage=Dictionary, /Type=/OCG, /Name=HeaderFooter)
Subdictionary /Usage = (/CreatorInfo=Dictionary, /PageElement=Dictionary)
Subdictionary /CreatorInfo = (/Creator=Acrobat PDFMaker 6.0 äëÿ Word)
Subdictionary /PageElement = (/SubType=/HF)
there's a lot of moving parts here. you might want to put together a test document that has only 3 or 4 characters in the font in question... There are a lot of type-1 fonts being used here (in addition to the TT fonts), so it's hard to tell what is involved in your particular issue.
(Are you sure you don't want to at least try this with iText? ;-) I'm not saying that it'll work, just that it might be worth a shot ).
For reference, the above dictionary dump was obtained using the com.lowagie.text.pdf.parser.PdfContentReaderTool class
Just try this one:
Phrase leftTitle = new Phrase("САНКТ-ПЕТЕРБУРГ", FontFactory.getFont("Tahoma", "Cp1251", true, 25));
This will work at least with latest (5.0.1) iText
精彩评论