开发者

Faster way to create tab deliminated text files?

Many of my programs output huge volumes of data for me to review on Excel. The best way to view all these files is to use a tab deliminated text format. Currently i use this chunk of code to get it done:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << " ";
    output << endl;
}

This seems to be a very slow operation, is a more efficient way of outputting text files like this to the hard drive?

Update:

Taking the two suggestions i开发者_C百科nto mind, the new code is this:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << "\t";
    output << "\n";
}
output.close();

writes to HD at 500KB/s

But this writes to HD at 50MB/s

{
    output.open(fileName.c_str(), std::ios::binary | std::ios::out);
    output.write(reinterpret_cast<char*>(arrayPointer), std::streamsize(dim * dim * sizeof(double)));
    output.close();
}


Use C IO, it's a lot faster than C++ IO. I've heard of people in programming contests timing out purely because they used C++ IO and not C IO.

#include <cstdio>

FILE* fout = fopen(fileName.c_str(), "w");

for (int j = 0; j < dim; j++) 
{ 
    for (int i = 0; i < dim; i++) 
        fprintf(fout, "%d\t", arrayPointer[j * dim + i]); 
    fprintf(fout, "\n");
} 
fclose(fout);

Just change %d to be the correct type.


Don't use endl. It will be flushing the stream buffers, which is potentially very inefficient. Instead:

output << '\n';


I decided to test JPvdMerwe's claim that C stdio is faster than C++ IO streams. (Spoiler: yes, but not necessarily by much.) To do this, I used the following test programs:

Common wrapper code, omitted from programs below:

#include <iostream>
#include <cstdio>
int main (void) {
  // program code goes here
}

Program 1: normal synchronized C++ IO streams

for (int j = 0; j < ROWS; j++) {
  for (int i = 0; i < COLS; i++) {
    std::cout << (i-j) << "\t";
  }
  std::cout << "\n";
}

Program 2: unsynchronized C++ IO streams

Same as program 1, except with std::cout.sync_with_stdio(false); prepended.

Program 3: C stdio printf()

for (int j = 0; j < ROWS; j++) {
  for (int i = 0; i < COLS; i++) {
    printf("%d\t", i-j);
  }
  printf("\n");
}

All programs were compiled with GCC 4.8.4 on Ubuntu Linux, using the following command:

g++ -Wall -ansi -pedantic -DROWS=10000 -DCOLS=1000 prog.cpp -o prog

and timed using the command:

time ./prog > /dev/null

Here are the results of the test on my laptop (measured in wall clock time):

  • Program 1 (synchronized C++ IO): 3.350s (= 100%)
  • Program 2 (unsynchronized C++ IO): 3.072s (= 92%)
  • Program 3 (C stdio): 2.592s (= 77%)

I also ran the same test with g++ -O2 to test the effect of optimization, and got the following results:

  • Program 1 (synchronized C++ IO) with -O2: 3.118s (= 100%)
  • Program 2 (unsynchronized C++ IO) with -O2: 2.943s (= 94%)
  • Program 3 (C stdio) with -O2: 2.734s (= 88%)

(The last line is not a fluke; program 3 consistently runs slower for me with -O2 than without it!)

Thus, my conclusion is that, based on this test, C stdio is indeed about 10% to 25% faster for this task than (synchronized) C++ IO. Using unsynchronized C++ IO saves about 5% to 10% over synchronized IO, but is still slower than stdio.


Ps. I tried a few other variations, too:

  • Using std::endl instead of "\n" is, as expected, slightly slower, but the difference is less than 5% for the parameter values given above. However, printing more but shorter output lines (e.g. -DROWS=1000000 -DCOLS=10) makes std::endl more than 30% slower than "\n".

  • Piping the output to a normal file instead of /dev/null slows down all the programs by about 0.2s, but makes no qualitative difference to the results.

  • Increasing the line count by a factor of 10 also yields no surprises; the programs all take about 10 times longer to run, as expected.

  • Prepending std::cout.sync_with_stdio(false); to program 3 has no noticeable effect.

  • Using (double)(i-j) (and "%g\t" for printf()) slows down all three programs a lot! Notably, program 3 is still fastest, taking only 9.3s where programs 1 and 2 each took a bit over 14s, a speedup of nearly 40%! (And yes, I checked, the outputs are identical.) Using -O2 makes no significant difference either.


does it have to be written in C? if not, there are many tools already written in C, eg (g)awk (can be used in unix/windows) that does the job of file parsing really well, also on big files.

awk '{$1=$1}1' OFS="\t" file


It may be faster to do it this way:

ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << '\t';
    output << '\n';
}


ofstream output (fileName.c_str());
for (int j = 0; j < dim; j++)
{
    for (int i = 0; i < dim; i++)
        output << arrayPointer[j * dim + i] << '\t';
    output << endl;
}

Use '\t' instead of " "

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜