Java stream misconceptions... some clarification?
I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read? For instance, bytes are read in as 8 bit 开发者_StackOverflow社区bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?
The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader
for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream
and write out using a FileOutputStream
, why is this file readable when I open it with a text editor? How is the FileOutputStream
treating the bytes?
The key concept here is character encoding: each human readable character is somehow encoded into one or more bytes. There are plenty of character encodings. The most popular ones are:
- ASCII (7 bit, remaining bit is unused) that treats one character as one byte
- UTF-8: most common characters are represented as a single byte, less common as 2 or even more
These encodings are readable even when you open a file in hex editor. However there many character encodings that do not have this feature, namely UTF-16 and UTF-32.
Now back to your question: InputStream
only gives you a stream of bytes. If your bytes represent characters encoded with ASCII or UTF-8, most of the time you are fine. But if these bytes represent something more sophisticated like UTF-16, you absolutely need a Reader
. Of course the reader has to know which character encoding does the underlying InputStream
provide. This is often a problem done by the beginners - Reader
not initialized with character encoding explicitly will often fall back to system default.
Other way (with writers) is similar. If you simply cast your char
s to byte
s, most of the time you will be fine. But if your characters contain less popular national letters, your output will be malformed/truncated. So you create a Writer
that converts each given charater to a series of one or more bytes. Once again you are obligated to provide the character encoding.
Important rules:
- always use
InputStream
when dealing with binary data (multimedia, ZIP and PDF files, etc.) - always use
Reader
when reading text (txt, HTML, XML...) - always know and specify character encoding when reading character from byte stream, always consciously choose character encoding you use to write the data.
A char
is a 16 bit string that represents a Unicode character.
A byte
is an 8 bit string that represents a 2's complement number.
The important thing here is that they are both bit strings. Technically speaking, a char
is simply 2 byte
s. Nothing more, nothing less aside from some minor semantics with how Java treats the two. As far as the computer (or Input/OutputStream
s) are concerned, the only difference is the number of bits they hold.
I think you need to grasp the relation between a byte and a character in order to get your clarification.
The accepted answer to this question is quite clear IMHO : Why does a byte in Java I/O can represent a character?
I'd also check out byte stream and character stream
And if you don't want Joel to catch you and make you peel onions for 6 months in a submarine, just read http://www.joelonsoftware.com/articles/Unicode.html
All IO streams in java are just byte streams underneath. Byte to Character(and vice versa) conversions are done using encoding. But underneath it all, they are all bytes.
To answer your questions:
I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read?
Characters are not bytes. A character is store in one or more bytes according to the selected encoding scheme. The encoding scheme removes/extends the limit of sorts of characters you can read.
For instance, bytes are read in as 8 bit bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?
In a way, yes.
The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream and write out using a FileOutputStream, why is this file readable when I open it with a text editor? How is the FileOutputStream treating the bytes?
For bytes/data corresponding to characters, you should use OutputStreamWriter
for writing to a file and make it readable with a text editor. You can specify encoding at creation and the stream will perform the encoding of you text data.
精彩评论