Reading a file faster in C

2023-02-08 04:33 问答作者：

Hmm i wonder whether is a way to read a FILE faster than using fscanf()

For example suppose that i have this text

First i want to read the first number which gives us the number of following lines.

Let this number be called N.

After N, 开发者_运维知识库I want to read N lines which have an integer and a character. With fscanf it would be like this

fscanf(fin,"%d %c",&a,&c);

You do almost no processing so probably the bottleneck is the file system throughput. However you should measure first if it really is. If you don't want to use a profiler, you can just measure the running time of your application. The size of input file divided by the running time can be used to check if you've reached the file system throughput limit.

Then if you are far away from aforementioned limit you probably need to optimize the way you read the file. It may be better to read it in larger chunks using fread() and then process the buffer stored in memory with sscanf().

You also can parse the buffer yourself which would be faster than *scanf().

[edit]

Especially for Drakosha:

$ time ./main1
Good entries: 10000000

real    0m3.732s
user    0m3.531s
sys 0m0.109s
$ time ./main2
Good entries: 10000000

real    0m0.605s
user    0m0.496s
sys 0m0.094s

So the optimized version makes ~127MB/s which may be my file system's bottleneck or maybe OS caches the file in RAM. The original version is ~20MB/s.

Tested with a 80MB file:

main1.c

#include <stdio.h>

int ok = 0;
void processEntry(int a, char c) {
    if (a == 1234 && c == 'a') {
        ++ok;
    }
}

int main(int argc, char **argv) {
    FILE *f = fopen("data.txt", "r");
    int total = 0;
    int a;
    char c;
    int i = 0;

    fscanf(f, "%d", &total);
    for (i = 0; i < total; ++i) {
        if (2 != fscanf(f, "%d %c", &a, &c)) {
            fclose(f);
            return 1;
        }
        processEntry(a, c);
    }
    fclose(f);
    printf("Good entries: %d\n", ok);
    return (ok == total) ? 0 : 1;
}

main2.c

#include <stdio.h>
#include <stdlib.h>

int ok = 0;
void processEntry(int a, char c) {
    if (a == 1234 && c == 'a') {
        ++ok;
    }
}

int main(int argc, char **argv) {
    FILE *f = fopen("data.txt", "r");
    int total = 0;
    int a;
    char c;
    int i = 0;
    char *numberPtr = NULL;
    char buf[2048];
    size_t toProcess = sizeof(buf);
    int state = 0;
    int fileLength, lengthLeft;

    fseek(f, 0, SEEK_END);
    fileLength = ftell(f);
    fseek(f, 0, SEEK_SET);

    fscanf(f, "%d", &total);  // read the first line

    lengthLeft = fileLength - ftell(f);

    // read other lines using FSM
    do {
        if (lengthLeft < sizeof(buf)) {
            fread(buf, lengthLeft, 1, f);
            toProcess = lengthLeft;
        } else {
            fread(buf, sizeof(buf), 1, f);
            toProcess = sizeof(buf);
        }
        lengthLeft -= toProcess;
        for (i = 0; i < toProcess; ++i) {
            switch (state) {
                case 0:
                    if (isdigit(buf[i])) {
                        state = 1;
                        a = buf[i] - '0';
                    }
                    break;
                case 1:
                    if (isdigit(buf[i])) {
                        a = a * 10 + buf[i] - '0';
                    } else {
                        state = 2;
                    }
                    break;
                case 2:
                    if (isalpha(buf[i])) {
                        state = 0;
                        c = buf[i];
                        processEntry(a, c);
                    }
                    break;
            }
        }
    } while (toProcess == sizeof(buf));

    fclose(f);
    printf("Good entries: %d\n", ok);
    return (ok == total) ? 0 : 1;
}

It is unlikely you can significantly speed-up the actual reading of the data. Most of the time here will be spent on transferring the data from disk to memory, which is unavoidable.

You might get a little speed-up by replacing the fscanf call with fgets and then manually parsing the string (with strtol) to bypass the format-string parsing that fscanf has to do, but don't expect any huge savings.

In the end, it is usually not worth it to heavily optimise I/O operations, because they will typically be dominated by the time it takes to transfer the actual data to/from the hardware/peripherals.

As usual, start with profiling to make sure this part is indeed a bottleneck. Actually, FileSystem cache should make the small reads that you are doing not very expensive, however reading larger parts of the file to memory and then operating on memory might be (a little) faster. In case (which i believe is extremely improbable) is that you need to save every CPU cycle, you might write your own fscanf variant, since you know the format of the string and you only need to support only one variant. But this improvement would bring low gains also, especially on modern CPUs.

The input looks like in various programming contests. In this case - optimize the algorithm, not the reading.

fgets() or fgetc() are faster, as they don't need to drag the whole formatting/variable argument list ballet of fscanf() into the program. Either one of those two functions will leave you with a manual character(s)-to-integer conversion however. Still, the program as whole will be much faster.

Not much hope to read file faster as it is a system call. But there is many ways to parse it faster than scanf with specialised code.

Checkout read and fread. As you practice for programming contests, you can ignore all warnings about disk IO buttle neck, cause files can be in memory or pipes from other processes generating tests "on-the-fly".

Put your tests into /dev/shm (new solution for tmpfs) or make test generator and pipe it.

I've found on programming contests, parsing numbers in manner to atoi can give much performance boost over scanf/fscanf (atoi might be not present, so be prepared to implement it by hand - it's easy).

继续阅读：c file performance scanf

Reading a file faster in C

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？