How exactly are data types represented in a computer?
I'm a beginning programmer reading K&R, and I feel as if the book assumes a lot of previous knowledge. One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable? I'm not too sure of how to word this question... but I'll ask a few questions and perhaps someone can come up with a coherent answer for me.
When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values. Since we may need the variable to hold the EOF value, we will need more than 256 or the EOF value will overlap with one of the 256 characters. In my mind, I view this as a bunch of boxes with empty holes. Could someone give me a better representation? Do these "boxes" have index numbers? When EOF overlaps with a value in the 256 available values, can we predict which value it will overlap with?
Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?
Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times? Is it to save "memory" (I use quotes as I do not actually how "memory" exactly works).
Lastly, how exactly is the 256 available values of type char obtained? I read something abo开发者_开发问答ut modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?
Great questions. K&R was written back in the days when there was a lot less to know about computers, and so programmers knew a lot more about the hardware. Every programmer ought to be familiar with this stuff, but (understandably) many beginning programmers aren't.
At Carnegie Mellon University they developed an entire course to fill in this gap in knowledge, which I was a TA for. I recommend the textbook for that class: "Computer Systems: A Programmer's Perspective" http://amzn.com/013034074X/
The answers to your questions are longer than can really be covered here, but I'll give you some brief pointers for your own research.
Basically, computers store all information--whether in memory (RAM) or on disk--in binary, a base-2 number system (as opposed to decimal, which is base 10). One binary digit is called a bit. Computers tend to work with memory in 8-bit chunks called bytes.
A char in C is one byte. An int is typically four bytes (although it can be different on different machines). So a char can hold only 256 possible values, 2^8. An int can hold 2^32 different values.
For more, definitely read the book, or read a few Wikipedia pages:
- http://en.wikipedia.org/wiki/Binary_numeral_system
- http://en.wikipedia.org/wiki/Twos_complement
Best of luck!
UPDATE with info on modular arithmetic as requested:
First, read up on modular arithmetic: http://en.wikipedia.org/wiki/Modular_arithmetic
Basically, in a two's complement system, an n-bit number really represents an equivalence class of integers modulo 2^n.
If that seems to make it more complicated instead of less, then the key things to know are simply:
- An unsigned n-bit number holds values from 0 to 2^n-1. The values "wrap around", so e.g., when you add two numbers and get 2^n, you really get zero. (This is called "overflow".)
- A signed n-bit number holds values from -2^(n-1) to 2^(n-1)-1. Numbers still wrap around, but the highest number wraps around to the most negative, and it starts counting up towards zero from there.
So, an unsigned byte (8-bit number) can be 0 to 255. 255 + 1 wraps around to 0. 255 + 2 ends up as 1, and so forth. A signed byte can be -128 to 127. 127 + 1 ends up as -128. (!) 127 + 2 ends up as -127, etc.
One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable?
At the machine level, the difference between int
and char
is only the size, or number of bytes, of the memory allocated for it by the programming language. In C, IIRC, a char
is one byte while an int
is 4 bytes. If you were to "look" at these inside the machine itself, you would see a sequence of bits for each. Being able to treat them as int
or char
depends on how the language decides to interpret them (this is also why its possible to convert back and forth between the two types).
When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values.
This is because there are 2^8, or 256 combinations of 8 bits (because a bit can have two possible values), whereas there are 2^32 combinations of 32 bits. The EOF constant (as defined by C) is a negative value, not falling within the range of 0 and 255. If you try to assign this negative value to a char (this squeezing its 4 bytes into 1), the higher-order bits will be lost and you will end up with a valid char value that is NOT the same as EOF. This is why you need to store it into an int and check before casting to a char.
Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as 0char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?
Yes, especially since in that case you are assigning a character literal.
Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times?
Most importantly, you would pick int
or char
at the language level depending on whether you wanted to treat the variable as a number or a letter (to switch, you would need to cast to the other type). If you wanted an integer value that took up less space, you could use a short int
(which I believe is 2 bytes), or if you were REALLY concerned with memory usage you could use a char
, though mostly this is not necessary.
Edit: here's a link describing the different data types in C and modifiers that can be applied to them. See the table at the end for sizes and value ranges.
Basically, system memory is one huge series of bits, each of which can be either "on" or "off". The rest is conventions and interpretation.
First of all, there is no way to access individual bits directly; instead they are grouped into bytes, usually in groups of 8 (there are a few exotic systems where this is not the case, but you can ignore that for now), and each byte gets a memory address. So the first byte in memory has address 0, the second has address 1, etc.
A byte of 8 bits has 2^8 possible different values, which can be interpreted as a number between 0 and 255 (unsigned byte), or as a number between -128 and +127 (signed byte), or as an ASCII character. A variable of type char
per C standard has a size of 1 byte.
But bytes are too small for a lot of things, so other types have been defined that are larger (i.e. they consist of multiple bytes), and CPUs support these different types through special hardware constructs. An int
is typically 4 bytes nowadays (though the C standard does not specify it and ints can be smaller or bigger on different systems) because 4 bytes are 32 bits, and until recently that was what mainstream CPUs supported as their "word size".
So a variable of type int
is 4 bytes large. That means when its memory address is e.g. 1000, then it actually covers the bytes at addresses 1000, 1001, 1002, and 1003. In C, it is possible to address those individual bytes as well at the same time, and that is how variables can overlap.
As a sidenote, most systems require larger types to be "word-aligned", i.e. their addresses have to be multiples of the word size, because that makes things easier for the hardware. So it is not possible to have an int variable start at address 999, or address 17 (but 1000 and 16 are OK).
I'm not going to completely answer Your question, but I would like to help You understand variables, as I had the same problems understanding them, when I began to program by myself.
For the moment, don't bother with the electronic representation of variables in memory. Think of memory as a continuous block of 1-byte-cells, each storing an bit-pattern (consisting of 0s and 1s).
By solely looking at the memory, You can't determine, what the bits in it represent! They are just arbitrary sequences of 0s and 1s. It is YOU, who specifies, HOW to interpret those bit patterns! Take a look at this example:
int a, b, c;
...
c = a + b;
You could have written the following as well:
float a, b, c;
...
c = a + b;
In both cases, the variables a, b and c are stored somewhere in the memory (and You can't tell their type). Now, when the compiler compiles Your code (that is translating Your program into machine instructions), it makes sure, to translate the "+" into integer_add in the first case and float_add in the second case, thus the CPU will interpret the bit patterns correctly and perform, what You desired.
Variable types are like glasses, that let the CPU look at a bit patterns from different perspectives.
G'day,
To go deeper, I'd highly recommend Charles Petzold's excellent book "Code"
It covers more than what you ask, all of which leads to a better understanding of what's actually happening under the covers.
HTH
Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer.
- In C,
EOF
is a "small negative number". - In C,
char
type may be unsigned, meaning that it cannot represent negative values. - For unsigned types, when you try to assign a negative value to them, they are converted to an unsigned value. If
MAX
is the maximum value an unsigned type can hold, then assigning-n
to such a type is equivalent to assigningMAX - (n % MAX) + 1
to it. So, to answer your specific question about predicting, "yes you can". For example, let's saychar
is unsigned, and can hold values0
to255
inclusive. Then assigning-1
to a char is equivalent to assigning255 - 1 + 1 = 255
to it.
Given the above, to be able to store EOF
in c
, c
can't be char
type. Thus, we use int
, because it can store "small negative values". Particularly, in C, int
is guaranteed to store values in the range -32767
and +32767
. That is why getchar()
returns int
.
Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?
If you are assigning values directly, then the C standard guarantees that expressions like 'a'
will fit in a char
. Note that in C, 'a'
is of type int
, not char, but it's okay to do char c = 'a'
, because 'a'
is able to fit in a char
type.
About your question as to what type a variable should hold, the answer is: use whatever type that makes sense. For example, if you're counting, or looking at string lengths, the numbers can only be greater than or equal to zero. In such cases, you should use an unsigned type. size_t
is such a type.
Note that it is sometimes hard to figure out the type of data, and even the "pros" may make mistakes. gzip
format for example, stores the size of the uncompressed data in the last 4 bytes of a file. This breaks for huge files > 4GB in size, which are fairly common these days.
You should be careful about your terminology. In C, a char c = 'a'
assigns an integer value corresponding to 'a'
to c
, but it need not be ASCII. It depends upon whatever encoding you happen to use.
About the "modulo" portion, and 256 values of type char
: if you have n
binary bits in a data type, each bit can encode 2 values: 0 and 1. So, you have 2*2*2...*2
(n
times) available values, or 2n. For unsigned types, any overflow is well-defined, it is as if you divided the number by (the maximum possible value+1), and took the remainder. For example, let's say unsigned char
can store values 0..255
(256 total values). Then, assigning 257
to an unsigned char
will basically divide it by 256, take the remainder (1), and assign that value to the variable. This relation holds true for unsigned types only though. See my answer to another question for more.
Finally, you can use char
arrays to read data from a file in C, even though you might end up hitting EOF
, because C provides other ways of detecting EOF
without having to read it in a variable explicitly, but you will learn about it later when you have read about arrays and pointers (see fgets()
if you're curious for one example).
According to "stdio.h" getchars() return value is int and EOF is defined as -1. Depending on the actual encoding all values between 0..255 can occur, there for unsigned char is not enough to represent the -1 and int is used. Here is a nice table with detailed information http://en.wikipedia.org/wiki/ISO/IEC_8859
The beauty of K&R is it's conciseness and readability, writers always have to make concessions for their goals; rather than being a 2000 page reference manual it serves as a basic reference and an excellent way to learn the language in general. I recommend Harbinson and Steele "C: A Reference Manual" for an excellent C reference book for details, and the C standard of course.
You need to be willing to google this stuff. Variables are represented in memory at specific locations and are known to the program of which they are a part of within a given scope. A char will typically be stored in 8 bits of memory (on some rare platforms this isn't necessarily true). 2^8 represents 256 distinct posibilities for variables. Different CPU/compilers/etc represent the basic types int, long of varying sizes. I think the C standard might specify minimum sizes for these, but not maximum sizes. I think for double it specifies at least 64 bits, but this doesn't preclude intel from using 80 bits in a floating point unit. Anyway, typical sizes in memory on 32bit intel platforms would be 32 bits (4 bytes) for unsigned/signed int and float, 64 bits (8 bytes) for double, 8 bits for char (signed/unsigned). You should also look up memory alignment if you are really interested on the topic. You can also at the exact layout in your debugger by getting the address of your variable with the "&" operator and then peeking at that address. Intel platforms may confuse you a little when looking at values in memory so please look up little endian/big endian as well. I am sure stack overflow has some good summaries of this as well.
All of the characters needed in a language are respresented by ASCII and Extended ASCII. So there is no character beyond the Extended ASCII.
While using char, there is probability of getting garbage value as it directly stores the character but using int, there is less probability of it as it stores the ASCII value of the character.
精彩评论