Fastest way to write an array of integers to a file in Java?
As the title says, I'm looking for the fastest possible way to write integer arrays to files. The arrays will vary in size, and will realistically contain anywhere between 2500 and 25 000 000 ints.
Here's the code I'm presently using:
DataOutputStream writer = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(filename)));
for (int d : data)
writer.writeInt(d);
Given that DataO开发者_StackOverflow社区utputStream has a method for writing arrays of bytes, I've tried converting the int array to a byte array like this:
private static byte[] integersToBytes(int[] values) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
for (int i = 0; i < values.length; ++i) {
dos.writeInt(values[i]);
}
return baos.toByteArray();
}
and like this:
private static byte[] integersToBytes2(int[] src) {
int srcLength = src.length;
byte[] dst = new byte[srcLength << 2];
for (int i = 0; i < srcLength; i++) {
int x = src[i];
int j = i << 2;
dst[j++] = (byte) ((x >>> 0) & 0xff);
dst[j++] = (byte) ((x >>> 8) & 0xff);
dst[j++] = (byte) ((x >>> 16) & 0xff);
dst[j++] = (byte) ((x >>> 24) & 0xff);
}
return dst;
}
Both seem to give a minor speed increase, about 5%. I've not tested them rigorously enough to confirm that.
Are there any techniques that will speed up this file write operation, or relevant guides to best practice for Java IO write performance?
I had a look at three options:
- Using
DataOutputStream
; - Using
ObjectOutputStream
(forSerializable
objects, whichint[]
is); and - Using
FileChannel
.
The results are
DataOutputStream wrote 1,000,000 ints in 3,159.716 ms
ObjectOutputStream wrote 1,000,000 ints in 295.602 ms
FileChannel wrote 1,000,000 ints in 110.094 ms
So the NIO version is the fastest. It also has the advantage of allowing edits, meaning you can easily change one int whereas the ObjectOutputStream
would require reading the entire array, modifying it and writing it out to file.
Code follows:
private static final int NUM_INTS = 1000000;
interface IntWriter {
void write(int[] ints);
}
public static void main(String[] args) {
int[] ints = new int[NUM_INTS];
Random r = new Random();
for (int i=0; i<NUM_INTS; i++) {
ints[i] = r.nextInt();
}
time("DataOutputStream", new IntWriter() {
public void write(int[] ints) {
storeDO(ints);
}
}, ints);
time("ObjectOutputStream", new IntWriter() {
public void write(int[] ints) {
storeOO(ints);
}
}, ints);
time("FileChannel", new IntWriter() {
public void write(int[] ints) {
storeFC(ints);
}
}, ints);
}
private static void time(String name, IntWriter writer, int[] ints) {
long start = System.nanoTime();
writer.write(ints);
long end = System.nanoTime();
double ms = (end - start) / 1000000d;
System.out.printf("%s wrote %,d ints in %,.3f ms%n", name, ints.length, ms);
}
private static void storeOO(int[] ints) {
ObjectOutputStream out = null;
try {
out = new ObjectOutputStream(new FileOutputStream("object.out"));
out.writeObject(ints);
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void storeDO(int[] ints) {
DataOutputStream out = null;
try {
out = new DataOutputStream(new FileOutputStream("data.out"));
for (int anInt : ints) {
out.write(anInt);
}
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void storeFC(int[] ints) {
FileOutputStream out = null;
try {
out = new FileOutputStream("fc.out");
FileChannel file = out.getChannel();
ByteBuffer buf = file.map(FileChannel.MapMode.READ_WRITE, 0, 4 * ints.length);
for (int i : ints) {
buf.putInt(i);
}
file.close();
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
}
}
private static void safeClose(OutputStream out) {
try {
if (out != null) {
out.close();
}
} catch (IOException e) {
// do nothing
}
}
I would use FileChannel
from the nio package and ByteBuffer
. This approach seems (on my computer) give 2 to 4 times better write performance:
Output from program:
normal time: 2555
faster time: 765
This is the program:
public class Test {
public static void main(String[] args) throws IOException {
// create a test buffer
ByteBuffer buffer = createBuffer();
long start = System.currentTimeMillis();
{
// do the first test (the normal way of writing files)
normalToFile(new File("first"), buffer.asIntBuffer());
}
long middle = System.currentTimeMillis();
{
// use the faster nio stuff
fasterToFile(new File("second"), buffer);
}
long done = System.currentTimeMillis();
// print the result
System.out.println("normal time: " + (middle - start));
System.out.println("faster time: " + (done - middle));
}
private static void fasterToFile(File file, ByteBuffer buffer)
throws IOException {
FileChannel fc = null;
try {
fc = new FileOutputStream(file).getChannel();
fc.write(buffer);
} finally {
if (fc != null)
fc.close();
buffer.rewind();
}
}
private static void normalToFile(File file, IntBuffer buffer)
throws IOException {
DataOutputStream writer = null;
try {
writer =
new DataOutputStream(new BufferedOutputStream(
new FileOutputStream(file)));
while (buffer.hasRemaining())
writer.writeInt(buffer.get());
} finally {
if (writer != null)
writer.close();
buffer.rewind();
}
}
private static ByteBuffer createBuffer() {
ByteBuffer buffer = ByteBuffer.allocate(4 * 25000000);
Random r = new Random(1);
while (buffer.hasRemaining())
buffer.putInt(r.nextInt());
buffer.rewind();
return buffer;
}
}
Benchmarks should be repeated every once in a while, shouldn't they? :) After fixing some bugs and adding my own writing variant, here are the results I get when running the benchmark on an ASUS ZenBook UX305 running Windows 10 (times given in seconds):
Running tests... 0 1 2
Buffered DataOutputStream 8,14 8,46 8,30
FileChannel alt2 1,55 1,18 1,12
ObjectOutputStream 9,60 10,41 11,68
FileChannel 1,49 1,20 1,21
FileChannel alt 5,49 4,58 4,66
And here are the results running on the same computer but with Arch Linux and the order of the write methods switched:
Running tests... 0 1 2
Buffered DataOutputStream 31,16 6,29 7,26
FileChannel 1,07 0,83 0,82
FileChannel alt2 1,25 1,71 1,42
ObjectOutputStream 3,47 5,39 4,40
FileChannel alt 2,70 3,27 3,46
Each test wrote an 800mb file. The unbuffered DataOutputStream took way to long so I excluded it from the benchmark.
As seen, writing using a file channel still beats the crap out of all other methods, but it matters a lot whether the byte buffer is memory-mapped or not. Without memory-mapping the file channel write took 3-5 seconds:
var bb = ByteBuffer.allocate(4 * ints.length);
for (int i : ints)
bb.putInt(i);
bb.flip();
try (var fc = new FileOutputStream("fcalt.out").getChannel()) {
fc.write(bb);
}
With memory-mapping, the time was reduced to between 0.8 to 1.5 seconds:
try (var fc = new RandomAccessFile("fcalt2.out", "rw").getChannel()) {
var bb = fc.map(READ_WRITE, 0, 4 * ints.length);
bb.asIntBuffer().put(ints);
}
But note that the results are order-dependent. Especially so on Linux. It appears that the memory-mapped methods doesn't write the data in full but rather offloads the job request to the OS and returns before it is completed. Whether that behaviour is desirable or not depends on the situation.
Memory-mapping can also lead to OutOfMemory problems so it is not always the right tool to use. Prevent OutOfMemory when using java.nio.MappedByteBuffer.
Here is my version of the benchmark code: https://gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b
I think you should consider using file channels (the java.nio library) instead of plain streams (java.io). A good starting point is this interesting discussion: Java NIO FileChannel versus FileOutputstream performance / usefulness
and the relevant comments below.
Cheers!
The main improvement you can have for writing int[] is to either;
increase the buffer size. The size is right for most stream, but file access can be faster with a larger buffer. This could yield a 10-20% improvement.
Use NIO and a direct buffer. This allows you to write 32-bit values without converting to bytes. This may yield a 5% improvement.
BTW: You should be able to write at least 10 million int values per second. With disk caching you increase this to 200 million per second.
Array is Serializable - can't you just use writer.writeObject(data);
? That's definitely going to be faster than individual writeInt
calls.
If you have other requirements on the output data format than retrieval into int[]
, that's a different question.
精彩评论