开发者

Why can't I use Ñ in my XML output, when declared as UTF-8?

I have the 'N Tilde' characte开发者_Go百科r Ñ in my Z/OS DB2 database. I am generating an xml file from the data. In the XML I have encoding=UTF-8, however Internet Explorer gives me the error Illegal character in text field. If I change the encoding to ISO-8859-1 it works fine.

I thought ISO-8859-1 was a subset of UTF-8, so why is it not working with UTF-8?

Is UTF-8 the best for an XML document?


ISO-8859-1 is not a subset of UTF-8. It can represent a subset of the characters representable in UTF-8, but it doesn't do so in the same way.

Both ISO-8859-1 and UTF-8 are supersets of ASCII (i.e. they can represent all characters that ASCII can represent and they represent them in the same way).

So you can't just label ISO-8859-1 data as UTF-8 and hope that it works, you need to actually store (or convert) your data as UTF-8.


UTF-8 ≠ Unicode

Note carefully:

  • ASCII is a subset of ISO 8859-1.
  • ASCII is a subset of Unicode.
  • ASCII is a subset of UTF-8.
  • ISO 8859-1 is a subset of Unicode.
  • ISO 8859-1 is not a subset of UTF-8.
  • Unicode is not the same thing as UTF-8.

I strongly advise familiarizing oneself with the subtleties in modern terminology.

If that’s too confusing, you might look at Radix-50, which has a repertoire many order of magnitude smaller than Unicode’s, but which nevertheless manifests several of the same subtleties that now escape people with respect to Unicode, character repertoires, coded character sets, character encoding forms, and character encoding schemes.

Java chars Incapable of Holding Characters

Since you’re coming at this from Java, it really isn’t your fault that these aren’t clearly separate concepts in your mind. That’s because Java gravely confuses these issue by not separating out the abstact code points (the logical characters) of a coded character set from the down-and-dirty mechanics of one particular character encoding form.

Java’s miserable conflation of chars with logical characters is error-prone in the extremely; perhaps it would be more accurate to say that Java programmers’ conflation of the same is miserable. In any event, there now seems to be no hope of remedy, ever.

Blame it all on the hysterical porpoises if you must, but the most charitable thing you can say about it is that it is highly unfortunate. Because of all this, well-meaning and perfectly competent programmers like yourself will forever be easily confused, and so will continually write Java code that is simple, clear, and wrong.

Education about all this is the only possible palliative, but it is no true cure.


ISO-8859-1 is not at all a subset of UTF-8. ASCII is a subset of both ISO-8859-1 and UTF-8. They specifically differ for characters in the Unicode code point range of U+0080 - U+00FF.

In ISO-8859-1, the character 'Ñ' (U+00D1 LATIN CAPITAL LETTER N WITH TILDE) is represented as the single byte D1. In UTF-8, the same character is represented by the two byte sequence C3 91.


For generating XML in Java, best thing to do would to use an XML library - this also ensures that everything is well-formed.

If you must create it by hand, best use new OutputStreamWriter(stream, encoding), where encoding is the same encoding as you are writing in your XML preamble.

Also make sure that the Strings you get from your database are encoded the right way.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜