开发者

Avoid printing unicode replacement character in Java

In Java, why does Character.toString((char) 65533) print out th开发者_如何学运维is symbol: � ?

I have a Java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?


One of the most likely scenarios is that you are trying to read ISO-8859 data using the UTF-8 character set. If you come across a sequence of characters that is not valid UTF-8, then it will be replaced with the � symbol.

Check your input streams, and ensure that you read them using the correct character set.


In java, why does Character.toString((char) 65533) print out this symbol: � ?

Because exact this particular character IS associated with the particular codepoint. It does not display a random character as you seem to think.

I have a java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?

Your problem lies somewhere else. It at least boils down that you should set every step which involves byte-char conversions (storing text in file/db, reading text from file/db, manipulating text, transferring text, displaying text, etcetera) to use UTF-8.

Which catches my eye is the fact that Java does absolutely nothing special with 0xFFFD, it just replaces uncovered chars by a question mark ? and that while you keep insisting that 0xFFFD comes from Java. I know that Firefox does exactly what you said, so are you maybe confusing "Firefox" with "Java"?

If this is true and you're actually talking about a Java webapplication, then you need to set at least the HTTP response encoding to UTF-8. You can do that by putting <%@ page pageEncoding="UTF-8" %> in top of the JSP page in question. You may find this article useful to get more background information and a detailed overview of all steps and solutions you need to apply to solve this "Unicode problem".


There is no Unicode character U+FFFD. Hence, the code is logically incorrect. The intended use of the Unicode Replacement Symbol is to be substitued for bad input (such as (char)65533).

How to fix it: don't put junk in strings. Strings are for text. Bytes are for random binary data.


Well, what do you want it to do? If you're getting these characters "all over the place" I suspect you have bad data... it should be pretty rare that you receive data which can't be represented in Unicode.

How are you getting the data to start with?


Have a look at this primer on character encodings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜