Random access of a large binary file
I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.
Things I've tried, which have more or less the same performance:
- Converting the file to the HDF5 format and using the C interface (slow).
- Writing a small C program to fseek() through the file (slow).
How can I r开发者_开发知识库andomly access this data really fast?
I want to get to less than a couple of seconds for the query.
The answer is basically "no".
A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.
Welcome to the physical world, software person :-). You must improve the locality of your operations.
First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.
Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)
Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.
But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)
[edit]
@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)
[edit 2]
As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).
Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.
Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.
Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM);
on the filedescriptor for the big file. Or, madvise()
if you've chosen to mmap()
it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.
Assuming you need N
random reads and you've got an M
msec seek time, it'll take at least N * m
milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.
Edit: A few things on mitigating strategies:
As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:
- Issue asynchronous reads if you can (i.e. if read operation
N+1
doesn't depend on what read operationN
did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking. - If you know the positions all in advance, then perform scatter-gather I/O (the UN*X
preadv()
would come to mind), to the same effect. - Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).
- Possibly even use query interfaces like
FIEMAP
/FIBMAP
(the Windows equivalent would roughly beFSCTL_GET_RETRIEVAL_POINTERS
) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two). - If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.
In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).
Have you tried mmaping the file? (in your case, mmap64). This will lazy-read the data from disk as you access it.
If you're having to seek through the entire file to find the data you're looking for you'll be able to speed it up with a SSD, but it's always going to be slow. Are the locations of the data you're looking for known ahead of time?
Is the file a text file, or a binary file?
If you have to read the whole file and you are using a mechanical hard disk, you are screwed. Assume the transfer rate is about 1 Gigabit / second, that means you physically can't get all the bits across the bus in less than 12 x 8 = 96 seconds. That assumes there is no seek time and the processor can deal with the data as it comes in.
Since transfer rate is limited by the speed of the drive as much as anything, even if you know exactly where every byte of data you want to read is, if they are spread out randomly across the file, it'll still take about as long because you have to wait for the disk to rotate until the next byte you want is under the head.
If you have an SSD you can probably improve on this dramatically, since there's no waiting for the bytes to come round under the head...
Some hints to speedup reading files a little (besides what was already said): - read chunks which are multiplied size of block - on POSIX compliant systems use posix_fadvise(), which advice to OS about paging.
I suppose it depends on how many seeks you need to do. 16 thousand, or a smaller number? Can you store the 12 GB file on a solid state drive? That would cut down on the seek latencies.
Can you break up the file and store the pieces on separate hard drives? That would enable asynchronous seeking in parallel.
Use parallel or asynchronous reads. Issue them from multiple threads, processes, etc. as necessary, or use preadv, just like FrankH said.
This means that you won't have to wait for one I/O request to complete before the next one can come along, which is going to improve performance if you have a clever RAID controller and lots of spindles.
On the other hand, if you have a really stupid I/O subsystem, it may make only a minor difference. Consider which I/O scheduler to use (you can change them on the fly, without a reboot, which is really cool). Anecdotal evidence suggests "noop" is best if you have "smart" hardware, cfq or deadline, if you have stupid hardware.
精彩评论