Reading a file faster in C
Hmm i wonder whether is a way to read a FILE faster than using fscanf()
For example suppose that i have this text
4
55 k
52 o
24 l
523 i
First i want to read the first number which gives us the number of following lines.
Let this number be called N.
After N, 开发者_运维知识库I want to read N lines which have an integer and a character.
With fscanf
it would be like this
fscanf(fin,"%d %c",&a,&c);
You do almost no processing so probably the bottleneck is the file system throughput. However you should measure first if it really is. If you don't want to use a profiler, you can just measure the running time of your application. The size of input file divided by the running time can be used to check if you've reached the file system throughput limit.
Then if you are far away from aforementioned limit you probably need to optimize the way you read the file. It may be better to read it in larger chunks using fread()
and then process the buffer stored in memory with sscanf()
.
You also can parse the buffer yourself which would be faster than *scanf()
.
[edit]
Especially for Drakosha:
$ time ./main1
Good entries: 10000000
real 0m3.732s
user 0m3.531s
sys 0m0.109s
$ time ./main2
Good entries: 10000000
real 0m0.605s
user 0m0.496s
sys 0m0.094s
So the optimized version makes ~127MB/s which may be my file system's bottleneck or maybe OS caches the file in RAM. The original version is ~20MB/s.
Tested with a 80MB file:
10000000
1234 a
1234 a
...
main1.c
#include <stdio.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
fscanf(f, "%d", &total);
for (i = 0; i < total; ++i) {
if (2 != fscanf(f, "%d %c", &a, &c)) {
fclose(f);
return 1;
}
processEntry(a, c);
}
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
main2.c
#include <stdio.h>
#include <stdlib.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
char *numberPtr = NULL;
char buf[2048];
size_t toProcess = sizeof(buf);
int state = 0;
int fileLength, lengthLeft;
fseek(f, 0, SEEK_END);
fileLength = ftell(f);
fseek(f, 0, SEEK_SET);
fscanf(f, "%d", &total); // read the first line
lengthLeft = fileLength - ftell(f);
// read other lines using FSM
do {
if (lengthLeft < sizeof(buf)) {
fread(buf, lengthLeft, 1, f);
toProcess = lengthLeft;
} else {
fread(buf, sizeof(buf), 1, f);
toProcess = sizeof(buf);
}
lengthLeft -= toProcess;
for (i = 0; i < toProcess; ++i) {
switch (state) {
case 0:
if (isdigit(buf[i])) {
state = 1;
a = buf[i] - '0';
}
break;
case 1:
if (isdigit(buf[i])) {
a = a * 10 + buf[i] - '0';
} else {
state = 2;
}
break;
case 2:
if (isalpha(buf[i])) {
state = 0;
c = buf[i];
processEntry(a, c);
}
break;
}
}
} while (toProcess == sizeof(buf));
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
It is unlikely you can significantly speed-up the actual reading of the data. Most of the time here will be spent on transferring the data from disk to memory, which is unavoidable.
You might get a little speed-up by replacing the fscanf
call with fgets
and then manually parsing the string (with strtol
) to bypass the format-string parsing that fscanf
has to do, but don't expect any huge savings.
In the end, it is usually not worth it to heavily optimise I/O operations, because they will typically be dominated by the time it takes to transfer the actual data to/from the hardware/peripherals.
As usual, start with profiling to make sure this part is indeed a bottleneck. Actually, FileSystem cache should make the small reads that you are doing not very expensive, however reading larger parts of the file to memory and then operating on memory might be (a little) faster. In case (which i believe is extremely improbable) is that you need to save every CPU cycle, you might write your own fscanf variant, since you know the format of the string and you only need to support only one variant. But this improvement would bring low gains also, especially on modern CPUs.
The input looks like in various programming contests. In this case - optimize the algorithm, not the reading.
fgets() or fgetc() are faster, as they don't need to drag the whole formatting/variable argument list ballet of fscanf() into the program. Either one of those two functions will leave you with a manual character(s)-to-integer conversion however. Still, the program as whole will be much faster.
Not much hope to read file faster as it is a system call. But there is many ways to parse it faster than scanf with specialised code.
Checkout read
and fread
. As you practice for programming contests, you can ignore all warnings about disk IO buttle neck, cause files can be in memory or pipes from other processes generating tests "on-the-fly".
Put your tests into /dev/shm
(new solution for tmpfs) or make test generator and pipe it.
I've found on programming contests, parsing numbers in manner to atoi
can give much performance boost over scanf/fscanf (atoi
might be not present, so be prepared to implement it by hand - it's easy).
精彩评论