How to have a checkpoint file using mmap which is only synced to disk manually
I need the fastest way to periodically sync file with memory.
What I think I would like is to have an mmap'd file, which is only sync'd to disk manually. I'm not sure how to prevent any automatic syncing from happening.
The file cannot be modified except at the times I manually specify. The point is to开发者_StackOverflow have a checkpoint file which keeps a snapshot of the state in memory. I would like to avoid copying as much as possible, since this will be need to called fairly frequently and speed is important.
Anything you write to the memory within a MAP_SHARED
mapping of a file is considered as being written to the file at that time, as surely as if you had used write()
. msync()
in this sense is completely analagous to fsync()
- it merely ensures that changes you have already made to the file are actually pushed out to permanent storage. You can't change this - it's how mmap()
is defined to work.
In general, the safe way to do this is to write a complete consistent copy of the data to a temporary file, sync the temporary file, then atomically rename it over the prior checkpoint file. This is the only way to ensure that a crash between checkpoints doesn't leave you with an inconsistent file. Any solution that does less copying is going to require both a more complicated transaction-log style file format, and be more intrusive to the rest of your application (requiring specific hooks to be invoked in each place that the in-memory state is changed).
You could mmap()
the file as copy on write so that any updates you do in memory are not written back to the file, then when you want to sync, you could:
A) Make a new memory mapping that is not copy on write and copy just the pages you modified into it.
Or
B) Open the file (regular file open) with direct I/O (block size aligned sized reading and writing) and write only the pages you modified. Direct I/O would be nice and fast because you're writing whole pages (memory page size is a multiple of disk block size) and there's no buffering. This method has the benefit of not using address space in case your mmap()
is large and there's no room to mmap()
another huge file.
After the sync, your copy on write mmap()
is the same as your disk file, but the kernel still has the pages you needed to sync marked as non shared (with the disk). So you can then close and recreate the mmap()
(still copy on write) that way the kernel can discard your pages if necessary (instead of paging them out to swap space) if there's memory pressure.
Of course, you'd have to keep track of which pages you had modified yourself because I can't think of how you'd get access to where the OS keeps that info. (wouldn't that be a handy syscall()
?)
-- edit --
Actually, see Can the dirtiness of pages of a mmap be found from userspace? for ideas on how to see which pages are dirty.
mmap
can't be used for this purpose. There's no way to prevent data from being written to disk. In practice, using mlock()
to make the memory unswappable might have a side effect of preventing it from getting written to disk except when you ask for it to be written, but there's no guarantee. Certainly if another process opens the file, it's going to see the copy cached in memory (with your latest changes), not the copy on physical disk. In many ways, what you should do depends on whether you're trying to do synchronization with other processes or just for safety in case of crash or power failure.
If your data size is small, you might try a number of other methods for atomic syncing to disk. One way is to store the entire dataset in a filename and create an empty file by that name, then delete the old file. If 2 files exist at startup (due to extremely unlikely crash time), delete the older one and resume from the newer one. write()
may also be atomic if your data size is smaller than a filesystem block, page size, or disk block, but I don't know of any guarantee to that effect right off. You'd have to do some research.
Another very standard approach that works as long as your data isn't so big that 2 copies won't fit on disk: just create a second copy with a temporary name, then rename()
it over top of the old one. rename()
is always atomic. This is probably the best approach unless you have a reason not to do it that way.
As the other respondents have suggested, I don't think there's a portable way to do what you want without copying. If you're looking to do this in a special-purpose environment where you can control the OS etc, you may be able to do it under Linux with the btrfs filesystem.
btrfs supports a new reflink()
operation which is essentially a copy-on-write filesystem copy. You could reflink()
your file to a temporary on start-up, mmap()
the temporary, then msync()
and reflink()
the temporary back to the original to checkpoint.
I highly suspect that may not be taken advantage of by any OS, but it would be possible for an OS to notice optimizations for:
int fd = open("file", O_RDWR | O_SYNC | O_DIRECT);
size_t length = get_lenght(fd);
uint8_t * map_addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
...
// This represents all of the changes that could possibly happen before you
// want to update the on disk file.
change_various_data(map_addr);
if (is_time_to_update()) {
write(fd, map_addr, length);
lseek(fd, 0, SEEK_SET);
// you could have just used pwrite here and not seeked
}
The reasons that an OS could possibly take advantage of this is that until you write to a particular page (and no one else did either) the OS would probably just use the actual file's page at that location as the swap for that page.
Then when you wrote to some set of those pages the OS would Copy On Write those pages for your process, but still keep the unwritten pages backed up by the original file.
Then, upon calling write
the OS could notice that the write was block aligned both in memory and on disk, and then it could notice that some of the source memory pages were already synched up with those exact file system pages that they were being written to and only write out the pages which had changed.
All of that being said, it wouldn't surprise me if this optimization isn't done by any OS, and this type of code ends up being really slow and causes lots of disk writing when you call 'write'. It would be cool if it was taken advantage of.
精彩评论