pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject

2023-01-03 09:44 问答作者：

I am inserting metadata into postscript files with a program, to be distilled to pdf with Adobe Distiller. I am using this code that I grabbed from an online chapter of Thomas Merz's "Web Publishing with Acrobat-PDF":

/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse

[ /Title (mot accenté)

  /Author (mot accenté)

  /Subject (mot accenté)

  /Keywords (mot accenté) 

/DOCINFO pdfmark

When you look at the metadata in the resulting pdf, the accented characters turn into "?" in the Subject and Keyword fields, but not the Title and Author fields. The characters are the same ascii 233

I tried replacing them with octal encoding (\351), which came out the same (Title and Author okay, Subject and Keywords messed up).

file encoding is latin-1,unix eol

I found a mention on adobe forums, but the answer didn't make sense to me.

http://forums.adobe.com/message/1165593 and http://forums.adobe.com/thread/307687

I changed the encoding to utf-8, inserted the characters binarily (in VIM : <Ctrl-v>u00e9), no change. I tried inserting the BOM in a few places, it didn't work.

This is with the Distiller from Acrobat Pro 9 (9.3.3177)

I didn't notice this problem with Acrobat Pro 7.

Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file, or开发者_JS百科 tell me if I'm doing it wrong?

It seems weird that different fields would not accept the same bytes.

Possibly related SO question: Unicode in PDF

I am embedding all fonts.

Can you try using UTF16-BE for the encoding and starting the strings with 254 and 255 (thorn and y-dieresis)?

Your last reference contained good hint to use Hex characters Unicode in PDF (see feedback from Mark Storer)

So instead of

[ /Title (mot accenté)

you could try

[ /Title <FFEF006D006F007400200061006300630065006E007400E9>`

etc ...

Might be little bit clumsy, but with the little help from shell scripts it helped me to add other special characters like 'ä', 'õ', 'ü' into pdf bookmarks.

So, you're supposed to be able to use an ANSI encoded file and any characters which are in the PDFDocEncoding set (which the French accented characters are), but that doesn't work.

Another method is to still use a latin-1 encoded file, but put Unicode characters in octal form (2 bytes: \xxx\xxx). And start the string with the BOM : \377\366

So the above subject string "mot accenté" has to be translated to:

/Subject (\377\376\155\000\157\000\164\000\040\000\141\000\143\000\143\000\145\000\156\000\164\000\351\000)

This works, but it sucks. Anyone have anything better?

You do not need to escape/encode ALL the accented characters!

It is enough to keep the standard ASCII characters and just mix in the \NNN notation where a special character should appear.

The following Ghostscript command creates a two page PDF. It will have nearly empty pages, with 2 bookmarks/outlines included, plus the metadata with accents. Example is for Windows, on Unix/Linux just use gs and change the line end escapes from DOS batch's ^ to unix shell's \:

gswin32c.exe ^
 -sDEVICE=pdfwrite ^
 -o 2-empty-pages-with-bookmarks-and-accents-in-metadata.pdf ^
 -c "[/Creator(brains&smarts)/Author(pipitas)/Subject(m\350t accent\351)/Title(mot accent\352)/Keywords(ganz sch\353\353 bl\353\353\d!)/DOCINFO pdfmark" ^
 -c "[/Page 1 /View [/XYZ null null null] /Title (Page One) /OUT pdfmark" ^
 -c "[/Page 2 /View [/XYZ null null null] /Title (Page Two) /OUT pdfmark" ^
 -c "200 500 moveto /Helvetica findfont 100 scalefont setfont (One) show showpage 200 500 moveto (Two) show showpage quit"
  .

I hope this finally settles your question "Does anybody know of a workaround to get the accented characters into ALL the metadata fields when modifying a postscript file?".

Altough this do not directly answer your question, google has lead me here when searching for "pdf metadata accented".

So, maybe useful for others to know that you can change a pdf metadata using pdftk

And to include accented characters, use HTML CODE

It took me some while to figure out how come "Baçan" was shown as "BaÄ§an", but that's because PDF metadata does not accept UTF8.

Example of metadata for Júlio Verne:

InfoKey: Author
InfoValue: J&#250;lio Verne

Also, I could use hexedit and manually insert the HEX code into the correct position.

é = HEX E9 HTML: &#233;
ç = HEX E7 HTML: &#231;
ú = HEX FA HTML: &#250;
ó = HEX F3 HTML: &#243;

and so on. Take a look at the table above.

I hope this serves to help someone.

继续阅读：pdf pdf-generation postscript

pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？