C - Dynamic Array
I'm trying to feed an array with fscanf() while looping through a file containing a list of integers, n integers long. It seems that I need to use malloc and/or potentially realloc. I've heard that the malloc command takes a noticeable amount of execution time and that it's best to over-allocate. Would someone mind helping me out with understanding the 开发者_Python百科building blocks of accomplishing this end?
Disclaimer: I'm new to C.
No, what you've heard is misleading (at least to me). malloc
is a just a function, and usually a fast one.
- Most of the time it does all of its job in user-space. It "overallocates" so you don't have to
- The bookkeeping (the linked list with free blocks etc.) is highly optimized since virtually everyone uses
malloc
It's unrealistic to think you can easily beat malloc
at this game. I am sorry if this doesn't answer your question (which was pretty general) but you have to realize there is no (spoon) optimization you can easily implement.
Reading the file will be much slower than allocating the memory!
You may want to read the whole file and find out how many entires you want and then malloc() all in one go.
malloc(sizeof(int)*n)
Premature optimization is the root of all evil (google it).
That said, allocate whatever amount you guess is reasonable/typical for the task at hand, and double it whenever you have to realloc. This strategy is rather hard to beat.
For your specific case, malloc isn't going to cause you issues. The run time of fscanf will be many, many times slower than the overhead of malloc and free. But, it can add up in high performance areas of an app. In these areas, there are other ways such as mem pools and fixed size allocators than can combat malloc()'s overhead. But, you are no where near needing to worry about performance overhead when you are just starting out.
Note that malloc()
adds some overhead to each allocation to maintain its internal data structures (at least 4 bytes in common implementations), so if you integers are 4 bytes long, doing a malloc()
for each integer would have >= 50% overhead (probably 75%). This would be the equivalent of using an array of Integer
's in Java, instead of an array of int
's.
As @Charles Dowd said, it's much better to allocate all the memory in one go, to avoid overhead.
You don't want to call malloc
or realloc
with every integer read, that's for sure. Can you estimate how much space you will need? Do you control the file format? If so, you could have the first line of the file be a single integer that denotes how many integers are to be read from the file. Then you could allocate all the space you need in one go. If you don't control the format and can't do this, follow the other suggestion mentioned in this thread: allocate a reasonably-sized buffer, and double it every time you run out of space.
It's a text file (not binary) and not in a fixed format, right? Otherwise it would be easy to calculate the size of the array from the file size ( buffer_size = file_size / record_size
, buffersize is in words (the size of an int), the other sizes is in bytes).
This is what I would do (but I'm a bit of a nutter when it comes to applied statistics).
1) What is the maximum number of characters (a.k.a. bytes) a number (a.k.a. record) will occupy in the file, don't forget to include the end-of-line characters (CR, NF) and other blank glyphs (spaces, tabs et.c.)? If you already can estimate what the average size of a record would be, then it is even better, you use that instead of the maximum size.
initial_buffer_size = file_size / max_record_size + 1 (/ is integer division)
2) Allocate that buffer, read your integers into that buffer until it is full. If the whole file is read then you are finished, otherwise resize or reallocate buffer to meet your new estimated needs.
resize_size =
prev_buffer_size
+ bytes_not_read / ( bytes_already_read / number_of_records_already_read )
+ 1
3) Read into that buffer (from where the previous reading ended) until it is full, or all the of file has been read.
4) If not finished, repeat from step 2) with the new prev_buffer_size
.
This will work best if the numbers (records) are totally randomly distributed from a byte-size point of view. If not, and if you have a clue what kind of distribution they have, you can adjust the algorithm according to that.
精彩评论