Parsing text from CMemFile line by line

2023-03-09 17:24 问答作者：

I have got a huge text file loaded into a CMemFile object and would like to parse it line by line (separated by newline chars).

Originally it is a zip file on disk, and I unzip it into memory to parse it, therefore the CMemFile.

One working way to read line by line is this (m_file is a smart pointer to a CMemFile):

    CArchive archive(m_file.get(), CArchive::load);

    CString line;

    while(archive.ReadString(line))
    {
        ProcessLine(string(line));
    }

Since it takes a lot of time, I tried to write my own routine:

    const UINT READSIZE = 1024;
    const char NEWLINE = '\n';
    char readBuf开发者_如何学Pythonfer[READSIZE];
    UINT bytesRead = 0;
    char *posNewline = NULL;

    const char* itEnd = readBuffer + READSIZE;
    ULONGLONG currentPosition = 0;
    ULONGLONG newlinePositionInBuffer = 0;

    do
    {
        currentPosition = m_file->GetPosition();

        bytesRead = m_file->Read(&readBuffer, READSIZE);        

        if(bytesRead == 0) break; // EOF

        posNewline = std::find(readBuffer, readBuffer + bytesRead, NEWLINE);

        if(posNewline != itEnd)
        {
            // found newline
            ProcessLine(string(readBuffer, posNewline));
            newlinePositionInBuffer = posNewline - readBuffer + 1; // +1 to skip \r
            m_file->Seek(currentPosition + newlinePositionInBuffer, CFile::begin);
        }
    } while(true);

Measuring the performance showed both methods take about the same time...

Can you think of any performance improvements or a faster way to do the parsing?

Thanks for any advice

A few notes and comments that may be useful:

Profiling is the only way for sure to know what the code is doing and how long it takes. Often the bottleneck is not obvious from the code itself. One basic method would be to time the loading, the uncompressing, and the parsing individually.
The actual loading of the file from disk, and in your case the uncompressing, may actually take significantly more time than the parsing, especially if your ProcessFile() function is a nop. If your parsing only takes 1% of the total time then you're never going to get much from trying to optimize that 1%. This is something profiling your code would tell you.
A general way to optimize a load/parse algorithm is to look at how many times a particular byte is read/parsed. The minimum, and possibly fastest, algorithm must read and parse each byte only once. Looking at your algorithms each byte appears to be copied a half-dozen times and potentially parsed a similar number. Reducing these numbers may help reduce the overall algorithm time, although the relative gain may not be much overall.

Using a profiler showed that 75 % of process time was wasted in this line of code:

 ProcessLine(string(readBuffer, posNewline));

Mainly the creation of the temporary string caused a big overhead (many allocs). The ProcessLine function itself contains no code. By changing the declaration from:

void ProcessLine(const std::string &);

to:

inline void ProcessLine(const char*, const char*);

process time used was reduced by a factor of five.

You could run both the decompression and the parsing in separate threads. Each time the decompression produces some data, you should pass it to the parsing thread using a message mechanism to parse.

This allows both to run in parallel, and also result in a smaller memory overhead since you work in blocks rather than the entire decompressed file (which would result in less page faults and swaps to virtual memory).

I think your problem might be that you are reading in too much and reseeking to new line.

If you file was

   foo
   bar
   etc

Say 10 bytes average on a line. You will read 10 lines...and read the 9 lines again.

继续阅读：mfc windows

Parsing text from CMemFile line by line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？