开发者

Parsing text from CMemFile line by line

I have got a huge text file loaded into a CMemFile object and would like to parse it line by line (separated by newline chars).

Originally it is a zip file on disk, and I unzip it into memory to parse it, therefore the CMemFile.

One working way to read line by line is this (m_file is a smart pointer to a CMemFile):

    CArchive archive(m_file.get(), CArchive::load);

    CString line;

    while(archive.ReadString(line))
    {
        ProcessLine(string(line));
    }

Since it takes a lot of time, I tried to write my own routine:

    const UINT READSIZE = 1024;
    const char NEWLINE = '\n';
    char readBuf开发者_如何学Pythonfer[READSIZE];
    UINT bytesRead = 0;
    char *posNewline = NULL;

    const char* itEnd = readBuffer + READSIZE;
    ULONGLONG currentPosition = 0;
    ULONGLONG newlinePositionInBuffer = 0;

    do
    {
        currentPosition = m_file->GetPosition();

        bytesRead = m_file->Read(&readBuffer, READSIZE);        

        if(bytesRead == 0) break; // EOF

        posNewline = std::find(readBuffer, readBuffer + bytesRead, NEWLINE);

        if(posNewline != itEnd)
        {
            // found newline
            ProcessLine(string(readBuffer, posNewline));
            newlinePositionInBuffer = posNewline - readBuffer + 1; // +1 to skip \r
            m_file->Seek(currentPosition + newlinePositionInBuffer, CFile::begin);
        }
    } while(true);

Measuring the performance showed both methods take about the same time...

Can you think of any performance improvements or a faster way to do the parsing?

Thanks for any advice


A few notes and comments that may be useful:

  • Profiling is the only way for sure to know what the code is doing and how long it takes. Often the bottleneck is not obvious from the code itself. One basic method would be to time the loading, the uncompressing, and the parsing individually.
  • The actual loading of the file from disk, and in your case the uncompressing, may actually take significantly more time than the parsing, especially if your ProcessFile() function is a nop. If your parsing only takes 1% of the total time then you're never going to get much from trying to optimize that 1%. This is something profiling your code would tell you.
  • A general way to optimize a load/parse algorithm is to look at how many times a particular byte is read/parsed. The minimum, and possibly fastest, algorithm must read and parse each byte only once. Looking at your algorithms each byte appears to be copied a half-dozen times and potentially parsed a similar number. Reducing these numbers may help reduce the overall algorithm time, although the relative gain may not be much overall.


Using a profiler showed that 75 % of process time was wasted in this line of code:

 ProcessLine(string(readBuffer, posNewline));

Mainly the creation of the temporary string caused a big overhead (many allocs). The ProcessLine function itself contains no code. By changing the declaration from:

void ProcessLine(const std::string &);

to:

inline void ProcessLine(const char*, const char*);

process time used was reduced by a factor of five.


You could run both the decompression and the parsing in separate threads. Each time the decompression produces some data, you should pass it to the parsing thread using a message mechanism to parse.

This allows both to run in parallel, and also result in a smaller memory overhead since you work in blocks rather than the entire decompressed file (which would result in less page faults and swaps to virtual memory).


I think your problem might be that you are reading in too much and reseeking to new line.

If you file was

   foo
   bar
   etc

Say 10 bytes average on a line. You will read 10 lines...and read the 9 lines again.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜