Most efficient way to copy a file in Linux
I am working at an OS independent file manager, and I am looking at the most efficient way to copy a file for Linux. Windows has a built in function, CopyFileEx(), but from what I've noticed, there is no such standard function for Linux. So I guess I will have to implement my own. The obvious way is fopen/fread/fwrite, but is there a better (faster) way of doing it? I must also have the ability t开发者_JAVA百科o stop every once in a while so that I can update the "copied so far" count for the file progress menu.
Unfortunately, you cannot use sendfile()
here because the destination is not a socket. (The name sendfile()
comes from send()
+ "file").
For zero-copy, you can use splice()
as suggested by @Dave. (Except it will not be zero-copy; it will be "one copy" from the source file's page cache to the destination file's page cache.)
However... (a) splice()
is Linux-specific; and (b) you can almost certainly do just as well using portable interfaces, provided you use them correctly.
In short, use open()
+ read()
+ write()
with a small temporary buffer. I suggest 8K. So your code would look something like this:
int in_fd = open("source", O_RDONLY);
assert(in_fd >= 0);
int out_fd = open("dest", O_WRONLY);
assert(out_fd >= 0);
char buf[8192];
while (1) {
ssize_t read_result = read(in_fd, &buf[0], sizeof(buf));
if (!read_result) break;
assert(read_result > 0);
ssize_t write_result = write(out_fd, &buf[0], read_result);
assert(write_result == read_result);
}
With this loop, you will be copying 8K from the in_fd page cache into the CPU L1 cache, then writing it from the L1 cache into the out_fd page cache. Then you will overwrite that part of the L1 cache with the next 8K chunk from the file, and so on. The net result is that the data in buf
will never actually be stored in main memory at all (except maybe once at the end); from the system RAM's point of view, this is just as good as using "zero-copy" splice()
. Plus it is perfectly portable to any POSIX system.
Note that the small buffer is key here. Typical modern CPUs have 32K or so for the L1 data cache, so if you make the buffer too big, this approach will be slower. Possibly much, much slower. So keep the buffer in the "few kilobytes" range.
Of course, unless your disk subsystem is very very fast, memory bandwidth is probably not your limiting factor. So I would recommend posix_fadvise
to let the kernel know what you are up to:
posix_fadvise(in_fd, 0, 0, POSIX_FADV_SEQUENTIAL);
This will give a hint to the Linux kernel that its read-ahead machinery should be very aggressive.
I would also suggest using posix_fallocate
to preallocate the storage for the destination file. This will tell you ahead of time whether you will run out of disk. And for a modern kernel with a modern file system (like XFS), it will help to reduce fragmentation in the destination file.
The last thing I would recommend is mmap
. It is usually the slowest approach of all thanks to TLB thrashing. (Very recent kernels with "transparent hugepages" might mitigate this; I have not tried recently. But it certainly used to be very bad. So I would only bother testing mmap
if you have lots of time to benchmark and a very recent kernel.)
[Update]
There is some question in the comments about whether splice
from one file to another is zero-copy. The Linux kernel developers call this "page stealing". Both the man page for splice
and the comments in the kernel source say that the SPLICE_F_MOVE
flag should provide this functionality.
Unfortunately, the support for SPLICE_F_MOVE
was yanked in 2.6.21 (back in 2007) and never replaced. (The comments in the kernel sources never got updated.) If you search the kernel sources, you will find SPLICE_F_MOVE
is not actually referenced anywhere. The last message I can find (from 2008) says it is "waiting for a replacement".
The bottom line is that splice
from one file to another calls memcpy
to move the data; it is not zero-copy. This is not much better than you can do in userspace using read
/write
with small buffers, so you might as well stick to the standard, portable interfaces.
If "page stealing" is ever added back into the Linux kernel, then the benefits of splice
would be much greater. (And even today, when the destination is a socket, you get true zero-copy, making splice
more attractive.) But for the purpose of this question, splice
does not buy you very much.
If you know they'll be using a linux > 2.6.17, splice()
is the way to do zero-copy in linux:
//using some default parameters for clarity below. Don't do this in production.
#define splice(a, b, c) splice(a, 0, b, 0, c, 0)
int p[2];
pipe(p);
int out = open(OUTFILE, O_WRONLY);
int in = open(INFILE, O_RDONLY)
while(splice(p[0], out, splice(in, p[1], 4096))>0);
Use open
/read
/write
— they avoid the libc-level buffering done by fopen
and friends.
Alternatively, if you are using GLib, you could use its g_copy_file
function.
Finally, what may be faster, but it should be tested to be sure: use open
and mmap
to memory-map the input file, then write
from the memory region to the output file. You'll probably want to keep open/read/write around as a fallback, as this method is limited to the address space size of your process.
Edit: original answer suggested mapping both files; @bdonlan made excellent suggestion in comment to only map one.
My answer from a more recent duplicate of this post.
boost now offers mapped_file_source
which portably models a memory-mapped file.
Maybe not as efficient as CopyFileEx()
and splice()
, but portable and succinct.
This program takes 2 filename arguments. It copies the first half of the source file to the destination file.
#include <boost/iostreams/device/mapped_file.hpp>
#include <iostream>
#include <fstream>
#include <cstdio>
namespace iostreams = boost::iostreams;
int main(int argc, char** argv)
{
if (argc != 3)
{
std::cerr << "usage: " << argv[0] << " <infile> <outfile> - copies half of the infile to outfile" << std::endl;
std::exit(100);
}
auto source = iostreams::mapped_file_source(argv[1]);
auto dest = std::ofstream(argv[2], std::ios::binary);
dest.exceptions(std::ios::failbit | std::ios::badbit);
auto first = source. begin();
auto bytes = source.size() / 2;
dest.write(first, bytes);
}
Depending on OS, your mileage may vary with system calls such as splice and sendfile, however note the comments in the man page:
Applications may wish to fall back to read(2)/write(2) in the case where sendfile() fails with EINVAL or ENOSYS.
I wrote some benchmarks to test this out and found copy_file_range
to be the fastest. Otherwise, either use a 128 KiB buffer or use a read-only mmap
for the src data and use the write
syscall for the dest data.
Article: https://alexsaveau.dev/blog/performance/files/kernel/the-fastest-way-to-copy-a-file
Benchmarks: https://github.com/SUPERCILEX/fuc/blob/fb0ec728dbd323f351d05e1d338b8f669e0d5b5d/cpz/benches/copy_methods.rs
Benchmarks inlined in case that link goes down:
use std::{
alloc,
alloc::Layout,
fs::{copy, File, OpenOptions},
io::{BufRead, BufReader, Read, Write},
os::unix::{fs::FileExt, io::AsRawFd},
path::{Path, PathBuf},
thread,
time::Duration,
};
use cache_size::l1_cache_size;
use criterion::{
criterion_group, criterion_main, measurement::WallTime, BatchSize, BenchmarkGroup, BenchmarkId,
Criterion, Throughput,
};
use memmap2::{Mmap, MmapOptions};
use rand::{thread_rng, RngCore};
use tempfile::{tempdir, TempDir};
// Don't use an OS backed tempfile since it might change the performance characteristics of our copy
struct NormalTempFile {
dir: TempDir,
from: PathBuf,
to: PathBuf,
}
impl NormalTempFile {
fn create(bytes: usize, direct_io: bool) -> NormalTempFile {
if direct_io && bytes % (1 << 12) != 0 {
panic!("Num bytes ({}) must be divisible by 2^12", bytes);
}
let dir = tempdir().unwrap();
let from = dir.path().join("from");
let buf = create_random_buffer(bytes, direct_io);
open_standard(&from, direct_io).write_all(&buf).unwrap();
NormalTempFile {
to: dir.path().join("to"),
dir,
from,
}
}
}
/// Doesn't use direct I/O, so files will be mem cached
fn with_memcache(c: &mut Criterion) {
let mut group = c.benchmark_group("with_memcache");
for num_bytes in [1 << 10, 1 << 20, 1 << 25] {
add_benches(&mut group, num_bytes, false);
}
}
/// Use direct I/O to create the file to be copied so it's not cached initially
fn initially_uncached(c: &mut Criterion) {
let mut group = c.benchmark_group("initially_uncached");
for num_bytes in [1 << 20] {
add_benches(&mut group, num_bytes, true);
}
}
fn empty_files(c: &mut Criterion) {
let mut group = c.benchmark_group("empty_files");
group.throughput(Throughput::Elements(1));
group.bench_function("copy_file_range", |b| {
b.iter_batched(
|| NormalTempFile::create(0, false),
|files| {
// Uses the copy_file_range syscall on Linux
copy(files.from, files.to).unwrap();
files.dir
},
BatchSize::LargeInput,
)
});
group.bench_function("open", |b| {
b.iter_batched(
|| NormalTempFile::create(0, false),
|files| {
File::create(files.to).unwrap();
files.dir
},
BatchSize::LargeInput,
)
});
#[cfg(target_os = "linux")]
group.bench_function("mknod", |b| {
b.iter_batched(
|| NormalTempFile::create(0, false),
|files| {
use nix::sys::stat::{mknod, Mode, SFlag};
mknod(files.to.as_path(), SFlag::S_IFREG, Mode::empty(), 0).unwrap();
files.dir
},
BatchSize::LargeInput,
)
});
}
fn just_writes(c: &mut Criterion) {
let mut group = c.benchmark_group("just_writes");
for num_bytes in [1 << 20] {
group.throughput(Throughput::Bytes(num_bytes));
group.bench_with_input(
BenchmarkId::new("open_memcache", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| {
let dir = tempdir().unwrap();
let buf = create_random_buffer(*num_bytes as usize, false);
(dir, buf)
},
|(dir, buf)| {
File::create(dir.path().join("file"))
.unwrap()
.write_all(&buf)
.unwrap();
(dir, buf)
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("open_nocache", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| {
let dir = tempdir().unwrap();
let buf = create_random_buffer(*num_bytes as usize, true);
(dir, buf)
},
|(dir, buf)| {
let mut out = open_standard(dir.path().join("file").as_ref(), true);
out.set_len(*num_bytes).unwrap();
out.write_all(&buf).unwrap();
(dir, buf)
},
BatchSize::PerIteration,
)
},
);
}
}
fn add_benches(group: &mut BenchmarkGroup<WallTime>, num_bytes: u64, direct_io: bool) {
group.throughput(Throughput::Bytes(num_bytes));
group.bench_with_input(
BenchmarkId::new("copy_file_range", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
// Uses the copy_file_range syscall on Linux
copy(files.from, files.to).unwrap();
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("buffered", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let reader = BufReader::new(File::open(files.from).unwrap());
write_from_buffer(files.to, reader);
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("buffered_l1_tuned", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let l1_cache_size = l1_cache_size().unwrap();
let reader =
BufReader::with_capacity(l1_cache_size, File::open(files.from).unwrap());
write_from_buffer(files.to, reader);
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("buffered_readahead_tuned", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let readahead_size = 1 << 17; // See https://eklitzke.org/efficient-file-copying-on-linux
let reader =
BufReader::with_capacity(readahead_size, File::open(files.from).unwrap());
write_from_buffer(files.to, reader);
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("buffered_parallel", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let threads = num_cpus::get() as u64;
let chunk_size = num_bytes / threads;
let from = File::open(files.from).unwrap();
let to = File::create(files.to).unwrap();
advise(&from);
to.set_len(*num_bytes).unwrap();
let mut results = Vec::with_capacity(threads as usize);
for i in 0..threads {
let from = from.try_clone().unwrap();
let to = to.try_clone().unwrap();
results.push(thread::spawn(move || {
let mut buf = Vec::with_capacity(chunk_size as usize);
// We write those bytes immediately after and dropping u8s does nothing
#[allow(clippy::uninit_vec)]
unsafe {
buf.set_len(chunk_size as usize);
}
from.read_exact_at(&mut buf, i * chunk_size).unwrap();
to.write_all_at(&buf, i * chunk_size).unwrap();
}));
}
for handle in results {
handle.join().unwrap();
}
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("buffered_entire_file", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let mut from = File::open(files.from).unwrap();
let mut to = File::create(files.to).unwrap();
advise(&from);
to.set_len(*num_bytes).unwrap();
let mut buf = Vec::with_capacity(*num_bytes as usize);
from.read_to_end(&mut buf).unwrap();
to.write_all(&buf).unwrap();
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("mmap_read_only", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let from = File::open(files.from).unwrap();
let reader = unsafe { Mmap::map(&from) }.unwrap();
let mut to = File::create(files.to).unwrap();
advise(&from);
to.write_all(reader.as_ref()).unwrap();
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("mmap_read_only_truncate", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let from = File::open(files.from).unwrap();
let reader = unsafe { Mmap::map(&from) }.unwrap();
let mut to = File::create(files.to).unwrap();
advise(&from);
to.set_len(*num_bytes).unwrap();
to.write_all(reader.as_ref()).unwrap();
files.dir
},
BatchSize::PerIteration,
)
},
);
#[cfg(target_os = "linux")]
group.bench_with_input(
BenchmarkId::new("mmap_read_only_fallocate", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let from = File::open(files.from).unwrap();
let reader = unsafe { Mmap::map(&from) }.unwrap();
let mut to = File::create(files.to).unwrap();
advise(&from);
allocate(&to, *num_bytes);
to.write_all(reader.as_ref()).unwrap();
files.dir
},
BatchSize::PerIteration,
)
},
);
group.bench_with_input(
BenchmarkId::new("mmap_rw_truncate", num_bytes),
&num_bytes,
|b, num_bytes| {
b.iter_batched(
|| NormalTempFile::create(*num_bytes as usize, direct_io),
|files| {
let from = File::open(files.from).unwrap();
let to = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(files.to)
.unwrap();
to.set_len(*num_bytes).unwrap();
advise(&from);
let reader = unsafe { Mmap::map(&from) }.unwrap();
let mut writer = unsafe { MmapOptions::new().map_mut(&to) }.unwrap();
writer.copy_from_slice(reader.as_ref());
files.dir
},
BatchSize::PerIteration,
)
},
);
}
fn open_standard(path: &Path, direct_io: bool) -> File {
let mut options = OpenOptions::new();
options.write(true).create(true).truncate(true);
#[cfg(target_os = "linux")]
if direct_io {
use nix::libc::O_DIRECT;
use std::os::unix::fs::OpenOptionsExt;
options.custom_flags(O_DIRECT);
}
let file = options.open(path).unwrap();
#[cfg(target_os = "macos")]
if direct_io {
use nix::{
errno::Errno,
libc::{fcntl, F_NOCACHE},
};
Errno::result(unsafe { fcntl(file.as_raw_fd(), F_NOCACHE) }).unwrap();
}
file
}
fn write_from_buffer(to: PathBuf, mut reader: BufReader<File>) {
advise(reader.get_ref());
let mut to = File::create(to).unwrap();
to.set_len(reader.get_ref().metadata().unwrap().len())
.unwrap();
loop {
let len = {
let buf = reader.fill_buf().unwrap();
if buf.is_empty() {
break;
}
to.write_all(buf).unwrap();
buf.len()
};
reader.consume(len)
}
}
#[cfg(target_os = "linux")]
fn allocate(file: &File, len: u64) {
use nix::{
fcntl::{fallocate, FallocateFlags},
libc::off_t,
};
fallocate(file.as_raw_fd(), FallocateFlags::empty(), 0, len as off_t).unwrap();
}
fn advise(_file: &File) {
// Interestingly enough, this either had no effect on performance or made it slightly worse.
// posix_fadvise(file.as_raw_fd(), 0, 0, POSIX_FADV_SEQUENTIAL).unwrap();
}
fn create_random_buffer(bytes: usize, direct_io: bool) -> Vec<u8> {
let mut buf = if direct_io {
let layout = Layout::from_size_align(bytes, 1 << 12).unwrap();
let ptr = unsafe { alloc::alloc(layout) };
unsafe { Vec::<u8>::from_raw_parts(ptr, bytes, bytes) }
} else {
let mut v = Vec::with_capacity(bytes);
// We write those bytes immediately after and dropping u8s does nothing
#[allow(clippy::uninit_vec)]
unsafe {
v.set_len(bytes);
}
v
};
thread_rng().fill_bytes(buf.as_mut_slice());
buf
}
criterion_group! {
name = benches;
config = Criterion::default().noise_threshold(0.02).warm_up_time(Duration::from_secs(1));
targets =
with_memcache,
initially_uncached,
empty_files,
just_writes,
}
criterion_main!(benches);
You may want to benchmark the dd command
精彩评论