开发者

Reading a file in C++

I am writing application to monitor a file and then match some pattern in that file. I want to know what is the fastest way to read a file in C++ Is reading line by l开发者_开发技巧ine is faster of reading chunk of the file is faster.


Your question is more about the performance of hardware, operating systems and run time libraries than it has to do with programming languages. When you start reading a file, the OS is probably loading the file in chunks anyway since the file is stored that way on disk, it makes sense for the OS to load each chunk entirely on first access and caching it rather than reading the chunk, extracting the requested data and discarding the rest.

Which is faster? Line by line or chunk at a time? As always with these things, the answer is not something you can predict, the only way to know for sure is to write a line-by-line version and a chunk-at-a-time version and profile them (measure how long it takes each version).


In general, reading large amounts of a file into a buffer, then parsing the buffer is a lot faster than reading individual lines. The actual proof is to profile code that reads line by line, then profile code reading in large buffers. Compare the profiles.

The foundation for this justification is:

  • Reduction of I/O Transactions
  • Keeping the Hard Drive Spinning
  • Parsing Memory Is Faster

I improved the performance of one application from 65 minutes down to 2 minutes, by appling these techniques.

Reduction of I/O Transactions
Reducing the I/O transactions results in few calls to the operating system, reducing time there. Reducing the number of branches in your code; improving the performance of the instruction pipeline in your processor. And also reduces traffic to the hard drive. The hard drive has less commands to process so it has less overhead.

Keeping the Hard Drive Spinning To access a file, the hard drive has to ramp up the motors to a decent speed (which takes time), position the head to the desired track and sector, and read the data. Positioning the head and ramping up the motor is overhead time required by all transactions. The overhead in reading the data is very little. The objective is to read as much data as possible in one transaction because this is where the hard drive is most efficient. Reducing the number of transactions will reduce the wait times for ramping up the motors and positioning the heads.

Although modern computers have caches for both data and commands, reducing the quantity will speed things up. Larger "payloads" will allow more efficient use of the their caches and not require overhead of sorting the requests.

Parsing Memory Is Faster
Always, reading from memory is faster than reading from an external source. Reading a second line of text from a buffer requires incrementing a pointer. Reading a second line from a file requires an I/O transaction to get the data into memory. If your program has memory to spare, haul the data into memory then search the memory.

Too Much Data Negates The Performance Savings
There is a finite amount of RAM on the computer for applications to share. Accessing more memory than this memory may cause the computer to "page" or forward the request to the hard drive (as known as virtual memory). In this case, there may be little savings because the hard drive is accessed anyway (by the Operating System without knowledge by your program). Profiling will give you a good indication as to the optimum size of the data buffer.

The application I optimized was reading one byte at a time from a 2 GB file. The performance greatly improved when I changed the program to read 1 MB chunks of data. This also allowed for addition performance with loop unrolling.

Hope this helps.


You could try to map the file directly to memory using a memory-mapped-file, and then use standard C++ logic to find the patterns that you want.


The OS (or even the C++ class you use) probably reads the file in chunks and caches it, even if you read it line by line to improve performance on minimizing disk access (on the operational system point of view would be faster for it to read data from a memory buffer than from a hard disk device).

Notice that a good way to improve performance on your programs (if it is really time critical), is to minimize the number of calls to operational system functions (which manage its resources).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜