pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject
I am inserting metadata into postscript files with a program, to be distilled to pdf with Adobe Distiller. I am using this code that I grabbed from an online chapter of Thomas Merz's "Web Publishing with Acrobat-PDF":
/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
[ /Title (mot accenté)
/Author (mot accenté)
/Subject (mot accenté)
/Keywords (mot accenté)
/DOCINFO pdfmark
When you look at the metadata in the resulting pdf, the accented characters turn into "?" in the Subject and Keyword fields, but not the Title and Author fields. The characters are the same ascii 233
I tried replacing them with octal encoding (\351), which came out the same (Title and Author okay, Subject and Keywords messed up).
file encoding is latin-1,unix eol
I found a mention on adobe forums, but the answer didn't make sense to me.
http://forums.adobe.com/message/1165593 and http://forums.adobe.com/thread/307687
I changed the encoding to utf-8, inserted the characters binarily (in VIM : <Ctrl-v>
u00e9), no change. I tried inserting the BOM in a few places, it didn't work.
This is with the Distiller from Acrobat Pro 9 (9.3.3177)
I didn't notice this problem with Acrobat Pro 7.
Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file, or开发者_JS百科 tell me if I'm doing it wrong?
It seems weird that different fields would not accept the same bytes.
Possibly related SO question: Unicode in PDF
I am embedding all fonts.
Can you try using UTF16-BE for the encoding and starting the strings with 254 and 255 (thorn and y-dieresis)?
Your last reference contained good hint to use Hex characters Unicode in PDF (see feedback from Mark Storer)
So instead of
[ /Title (mot accenté)
you could try
[ /Title <FFEF006D006F007400200061006300630065006E007400E9>`
etc ...
Might be little bit clumsy, but with the little help from shell scripts it helped me to add other special characters like 'ä', 'õ', 'ü' into pdf bookmarks.
So, you're supposed to be able to use an ANSI encoded file and any characters which are in the PDFDocEncoding set (which the French accented characters are), but that doesn't work.
Another method is to still use a latin-1 encoded file, but put Unicode characters in octal form (2 bytes: \xxx\xxx). And start the string with the BOM : \377\366
So the above subject string "mot accenté" has to be translated to:
/Subject (\377\376\155\000\157\000\164\000\040\000\141\000\143\000\143\000\145\000\156\000\164\000\351\000)
This works, but it sucks. Anyone have anything better?
You do not need to escape/encode ALL the accented characters!
It is enough to keep the standard ASCII characters and just mix in the \NNN notation where a special character should appear.
The following Ghostscript command creates a two page PDF. It will have nearly empty pages, with 2 bookmarks/outlines included, plus the metadata with accents. Example is for Windows, on Unix/Linux just use gs
and change the line end escapes from DOS batch's ^
to unix shell's \
:
gswin32c.exe ^
-sDEVICE=pdfwrite ^
-o 2-empty-pages-with-bookmarks-and-accents-in-metadata.pdf ^
-c "[/Creator(brains&smarts)/Author(pipitas)/Subject(m\350t accent\351)/Title(mot accent\352)/Keywords(ganz sch\353\353 bl\353\353\d!)/DOCINFO pdfmark" ^
-c "[/Page 1 /View [/XYZ null null null] /Title (Page One) /OUT pdfmark" ^
-c "[/Page 2 /View [/XYZ null null null] /Title (Page Two) /OUT pdfmark" ^
-c "200 500 moveto /Helvetica findfont 100 scalefont setfont (One) show showpage 200 500 moveto (Two) show showpage quit"
.
I hope this finally settles your question "Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file?".
Altough this do not directly answer your question, google has lead me here when searching for "pdf metadata accented".
So, maybe useful for others to know that you can change a pdf metadata using pdftk
And to include accented characters, use HTML CODE
It took me some while to figure out how come "Baçan" was shown as "Baħan", but that's because PDF metadata does not accept UTF8.
Example of metadata for Júlio Verne:
InfoKey: Author
InfoValue: Júlio Verne
Also, I could use hexedit and manually insert the HEX code into the correct position.
é = HEX E9 HTML: é
ç = HEX E7 HTML: ç
ú = HEX FA HTML: ú
ó = HEX F3 HTML: ó
and so on. Take a look at the table above.
I hope this serves to help someone.
精彩评论