开发者

Java stream misconceptions... some clarification?

I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read? For instance, bytes are read in as 8 bit 开发者_StackOverflow社区bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?

The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream and write out using a FileOutputStream, why is this file readable when I open it with a text editor? How is the FileOutputStream treating the bytes?


The key concept here is character encoding: each human readable character is somehow encoded into one or more bytes. There are plenty of character encodings. The most popular ones are:

  • ASCII (7 bit, remaining bit is unused) that treats one character as one byte
  • UTF-8: most common characters are represented as a single byte, less common as 2 or even more

These encodings are readable even when you open a file in hex editor. However there many character encodings that do not have this feature, namely UTF-16 and UTF-32.

Now back to your question: InputStream only gives you a stream of bytes. If your bytes represent characters encoded with ASCII or UTF-8, most of the time you are fine. But if these bytes represent something more sophisticated like UTF-16, you absolutely need a Reader. Of course the reader has to know which character encoding does the underlying InputStream provide. This is often a problem done by the beginners - Reader not initialized with character encoding explicitly will often fall back to system default.

Other way (with writers) is similar. If you simply cast your chars to bytes, most of the time you will be fine. But if your characters contain less popular national letters, your output will be malformed/truncated. So you create a Writer that converts each given charater to a series of one or more bytes. Once again you are obligated to provide the character encoding.

Important rules:

  • always use InputStream when dealing with binary data (multimedia, ZIP and PDF files, etc.)
  • always use Reader when reading text (txt, HTML, XML...)
  • always know and specify character encoding when reading character from byte stream, always consciously choose character encoding you use to write the data.


A char is a 16 bit string that represents a Unicode character.

A byte is an 8 bit string that represents a 2's complement number.

The important thing here is that they are both bit strings. Technically speaking, a char is simply 2 bytes. Nothing more, nothing less aside from some minor semantics with how Java treats the two. As far as the computer (or Input/OutputStreams) are concerned, the only difference is the number of bits they hold.


I think you need to grasp the relation between a byte and a character in order to get your clarification.

The accepted answer to this question is quite clear IMHO : Why does a byte in Java I/O can represent a character?

I'd also check out byte stream and character stream

And if you don't want Joel to catch you and make you peel onions for 6 months in a submarine, just read http://www.joelonsoftware.com/articles/Unicode.html


All IO streams in java are just byte streams underneath. Byte to Character(and vice versa) conversions are done using encoding. But underneath it all, they are all bytes.


To answer your questions:

I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read?

Characters are not bytes. A character is store in one or more bytes according to the selected encoding scheme. The encoding scheme removes/extends the limit of sorts of characters you can read.

For instance, bytes are read in as 8 bit bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?

In a way, yes.

The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream and write out using a FileOutputStream, why is this file readable when I open it with a text editor? How is the FileOutputStream treating the bytes?

For bytes/data corresponding to characters, you should use OutputStreamWriter for writing to a file and make it readable with a text editor. You can specify encoding at creation and the stream will perform the encoding of you text data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜