开发者

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best开发者_JAVA百科:

3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.

public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {

    try {
        BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
        BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
        PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));

        String forwardLine = null;
        System.out.println("Begin merging Fastq files");
        int readsMerge = 0;
        while ((forwardLine = inputReaderForward.readLine()) != null) {

            //append the forward file
            outputWriter.println(forwardLine);
            outputWriter.println(inputReaderForward.readLine());
            outputWriter.println(inputReaderForward.readLine());
            outputWriter.println(inputReaderForward.readLine());
            //append the reverse file
            outputWriter.println(inputReaderReverse.readLine());
            outputWriter.println(inputReaderReverse.readLine());
            outputWriter.println(inputReaderReverse.readLine());
            outputWriter.println(inputReaderReverse.readLine());

            readsMerge++;
            if(readsMerge % 10000 == 0) {
                System.out.println("[" + now() + "] Merged 10000");
                readsMerge = 0;
            }

        }

        inputReaderForward.close();
        inputReaderReverse.close();
        outputWriter.close();

    } catch (IOException ex) {
        Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
    }
}


Maybe you also want to try to use a BufferedWriter to cut down your file IO operations. http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html


A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.

Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.


If performance was the main requirement, then I would code this function in C or C++ instead of Java.

But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.


Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.

void write(InputStream is, OutputStream os) throws IOException {
    byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
    int num;
    while((n = is.read(buf)) != -1){
        os.write(buffer, 0, num);
    }
}

EDIT: I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜