开发者

Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?

Overview

I have a program bounded significantly by IO and am trying to speed it up. Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.

Some demo code

I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:

With fgets:

char buf[4096];
FILE * fp = fopen(argv[1], "r");

while(fgets(buf, 4096, fp) != 0) {
    // do stuff
}
fclose(fp);
return 0;

Runtime for 800mb file:

[juhani@xtest tests]$ time ./readfile /r/40/13479/14960 

real    0m25.614s
user    0m0.192s
sys 0m0.124s

The mmap version:

struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;

mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
    // do stuff
    row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);

Runtime varies quite a bit, but never faster than fgets:

[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m28.891s
user    0m0.252s
sys 0m0.732s
[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m42.605s
user    0m0.144s
sys 0m0.472s

Other notes

  • Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
  • CPU and memory usage are both very low for the fgets version.

Questions

  • Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvis开发者_C百科e POSIX_MADV_SEQUENTIAL?
  • Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.


POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.

The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.

This also has more potential for overlap, since the IO is done by some kernel thread.

You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.


Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:

MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.

This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.

Edit: First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:

    echo 3 | sudo tee /proc/sys/vm/drop_caches

Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.


What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.

In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.

A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.

What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.

Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging. With mmap, you're guaranteed demand-paging; with read() you are not.

Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).

One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.

Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.

A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.

Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.

So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.

Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.

You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.

Some more specific suggestions for your particular problem: You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing. If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do. You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.

You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons. If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets. Write algorithms to update that list on insertion/deletion of lines from the file

When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.

A few more things: When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.

Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter. The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done. I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.

That's the best I've got, maybe the experts will have other ideas. Carry onward!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜