开发者

What is the character encoding of String in Java?

I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:

1) What is the native encoding of Java strings in memory? When I write String a = "Hello" in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c' I get the number 开发者_开发问答of the character in the ASCII table. So are ASCII and UTF-16 the same?

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?


  1. Java stores strings as UTF-16 internally.

  2. "default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.

    ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your int i = 'x' test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however.

  3. From the java.lang.Character docs:

    The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

    So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.


1) Strings are objects, which typically contain a char array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c', which is U+0063, becomes 0x0063, or 99.

3) Since each String is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).

ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).

EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String object also contains an int length and a char[] array reference. The actual character contents of the string are stored in a second object, the char[] array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char elements as needed for the string (plus any extra chars that were left hanging around when the string was created).

ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two chars in a Java String.

Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html


While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html


Edit : thanks to LoadMaster for helping me correcting my answer :)

1) All internal String processing is made in UTF-16.

2) ASCII is a subset of UTF-16.

3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜