Dynamically storing information from a file using C
I'm new to C and trying to learn a few things. What I'm trying to do is read in a file and store the information. Since the format will be a CSV, the plan is to read in each character, determine if its a number or a comma, and store the numbers in a linked list. The problem I'm having is reading in numbers that are more than one c开发者_JAVA百科haracter long like the following example.
5,2,24,5
Here's the code I've got so far and its just not giving back output that I expect. Here's the code, and the output is below the code sample.
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
struct list {
float value;
struct list * next;
struct list * prev;
};
int main( int argc, char *argv[] ){
FILE *infile;
char *token = NULL;
char my_char;
/* Open the file. */
// The file name should be in argv[1]
if((infile = fopen(argv[1], "r")) == NULL) {
printf("Error Opening File.\n");
printf("ERROR: %s\n", strerror(errno));
exit(1);
}
while((my_char = (char)fgetc(infile)) != EOF){
//Is my_char a number?
if(isdigit(my_char)){
if(token == NULL){
token = (char *)malloc(sizeof(char));
memset(token, '\0', 1);
strcpy(token, &my_char);
printf("length of token -> %d\n", strlen(token));
printf("%c\n", *token);
} else {
token = (char *)realloc(token, sizeof(token) + 1);
strcat(token, &my_char);
printf("%s\n", token);
}
}
}
free(token);
fclose(infile);
}
And here is the output:
[estest@THEcomputer KernelFunctions]$ nvcc linear_kernel.cu -o linear_kernel.exe
[estest@THEcomputer KernelFunctions]$ ./linear_kernel.exe iris.csv
length of token -> 5
5
5a#1a#
5a#1a#3a#
5a#1a#3a#5a#
5a#1a#3a#5a#1a#
5a#1a#3a#5a#1a#4a#
*** glibc detected *** ./linear_kernel.exe: realloc(): invalid next size: 0x0000000001236350 ***
I don't understand why the length of the token is '5' when I expect to be 1 and the strange looking characters that follow 5 (represented by 'a#'). Can anyone help me understand this a little better?
char *token = NULL;
token = (char *)realloc(token, sizeof(token) + 1);
token
is a pointer. sizeof
doesn't give you the allocated size of the chunk of memory to which it points; it gives you the size of the pointer object itself. Apparently pointers are 4 bytes on your system (that's typical), so you're always reallocating to 5 bytes.
Some more suggestions:
exit(1);
exit(EXIT_FAILURE)
is more portable.
char my_char;
while((my_char = (char)fgetc(infile)) != EOF){
fgetc
returns an int, not a char. The value is either the next character read from the file (represented as an unsigned char and then converted to int, so typically in the range 0..255) or the value EOF
(which is typically -1). If plain char is signed on your system, an input character that happens to be 255 will cause your loop to terminate prematurely; if plain char is unsigned, your loop may never end, because you're converting the negative value of EOF
to a signed value. I'm actually not 100% sure what happens in the latter case, but it doesn't matter; make my_char
an int.
token = (char *)malloc(sizeof(char));
Don't cast the result of malloc()
. It's not necessary (malloc()
returns a void*
so it can be converted implicitly), and it can hide errors. sizeof(char)
is 1 by definition. Just write:
token = malloc(1);
And always check the return value; malloc()
returns NULL on failure.
memset(token, '\0', 1);
Simpler: *token = '\0';
Allocating a single byte, then realloc()
ating one additional byte at a time, is likely to be terribly inefficient.
strcat(token, &my_char);
The second argument to strcat()
must be a pointer to a string. &my_char
is of the right type, but if the byte following my_char
in memory doesn't happen to be a '\0'
, Bad Things Can Happen
.
This is not an exhaustive review.
Recommended reading: the comp.lang.c FAQ.
The main issue appears to be a problem with null terminated strings. The malloc
call is allocating 1 byte. But strcpy
copies bytes until it reaches a null terminator (a zero byte). So the results are not well defined since the byte after my_char
is a "random" value from the stack.
You need to allocate one byte longer (and realloc one byte longer) than the length of the string to allow for a null terminator. And the strcpy
and strcat
calls are not valid for the source "string" which is actually just a character. To continue using the basic logic that you are implementing, it would be necessary to simply assign the character value to the appropriate position in the token
array. Alternatively, you could declare my_char
as a two byte character array and set the second byte to a 0 terminator to allow strcpy
and strcat
to be used. For example,
char my_char[2];
my_char[1] = '\0';
And then it would be necessary to change the usage of my_char
accordingly (assign the value to my_char[0]
, and remove the &
in the strcpy/strcat calls). The compiler warnings/errors would help address those changes.
You're allocating only 1 byte of data for your string in your code:
token = (char *)malloc(sizeof(char));
memset(token, '\0', 1);
However, because you're only zeroing out one byte, your string is not necessarily null terminated. What you're most likely seeing is extra junk that was in the memory after your char *.
For one, it would be a lot easier for you to read 1 whole line at a time as opposed to 1 character at a time. You can then use strtok()
to split the line by the commas.
There are a few problems with your code:
token = (char *)malloc(sizeof(char));
This will only allocate 1 byte. C strings have to be null-terminated, so even a string of length 1 requires 2 bytes of allocated space.
strcpy(token, &my_char);
strcat(token, &my_char);
my_char
is a single character, not a null-terminated string (which is what strcpy()
and strcat()
expect).
sizeof(token)
This is not what you mean to do. This will return you the size of a pointer (which is the type of token
. You probably want something like strlen()
, but you'd have to refactor your code to make sure you're using null-terminated strings as opposed to single characters.
Your my_char
should be int
because that's what fgetc
returns, using a char
will mean that you'll never find your EOF condition:
int my_char;
/*...*/
while((my_char = fgetc(infile)) != EOF) {
The EOF
value is an int
that is not a valid char
, that's how you can detect the end of a file while reading it one byte at a time and from the fine manual:
If the integer value returned by fgetc() is stored into a variable of type char and then compared against the integer constant EOF, the comparison may never succeed, because sign-extension of a variable of type char on widening to integer is implementation-defined.
Others have pointed out your memory errors so I'll leave those alone.
while((my_char = (char)fgetc(infile)) != EOF){
This is bad times. fgetc
returns int
. It can represent more values than char
. EOF
is typically -1
. Since you're storing in a char
, how do you expect to represent the character 0xff
? You won't; you'll end up treating it as EOF
. You should do this:
int c;
while ((c=fgetc(infile)) != EOF)
{
char my_char = c;
Next up...
token = (char *)malloc(sizeof(char));
You should check the return value of malloc
. You should also consider allocating more than you need up front, otherwise every call to realloc
could potentially have to copy the characters that you've seen so far. You will get better algorithmic complexity by, say, making every allocation size a power of 2. Also, unlike C++, in C you don't need to cast from void*
.
memset(token, '\0', 1);
strcpy(token, &my_char);
This is not what you think it means. (&my_char)[1]
must be zero for this to work, so this is undefined behavior. You should try this:
token[0] = my_char;
token[1] = 0;
Also, you only allocated 1 char
. You need 2 for this to work.
token = (char *)realloc(token, sizeof(token) + 1);
sizeof
does not magically remember how much you allocated last time, it only takes the compile-time size of the type it's specified, in this case equivalent to sizeof(char*)
which would be 4 or 8 on 32 or 64-bit systems respectively. You need to track the real allocation size in a variable. Also this kind of realloc
is prone to leak memory on failure, you should do this:
void *ptr = realloc(token, new_length);
if (!ptr) { /* TODO: handle error */ }
token = ptr;
Moving on...
strcat(token, &my_char);
This has the same undefined behavior as the last use of &my_char
as if it was a C string. Also, even if it did work it is wasteful, since strcat
must traverse the entire string to find the end.
Summary of my suggestions follows:
int c;
size_t alloc_size = 0;
size_t current_len = 0;
char *token = NULL;
void *ptr;
while ((c = fgetc(infile)) != EOF)
{
if (is_digit(c))
{
if (alloc_size < current_len + 2)
{
if (!alloc_size)
{
// Set some arbitrary start size...
//
alloc_size = 64;
}
else
{
alloc_size *= 2;
}
if (!token)
ptr = malloc(alloc_size);
else
ptr = realloc(token, alloc_size);
if (!ptr)
{
free(token);
return -1;
}
}
token[current_len++] = c;
token[current_len] = 0;
}
}
/* TODO: do something with token... */
free(token);
The implementation of strcpy
is as simple as
while(*dest++ = *src++);
So, memory pointed by src
is expected to end with at least one '\0' character. In your case, the single element array holds a character that's not null. Hence, strcpy
goes beyond it's memory and ends up dereferencing outside of its segment resulting in a fault. This is not observed when a call like strcpy(buff, "abcd")
is made because, the compiler places abcd\0
in the code section of the program.
To solve your problem in general, using fgetline
and strtok
will be a better and easier way of solving it.
精彩评论