开发者

what is the efficent way to process larges text files?

I have two files:

1- with 1400000 line or record --- 14 MB

2- with 16000000 -- 170 MB

I want to find if each record or line in file 1 is also in file 2 or not

I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2

Here is my code:

public boolean hasIDin(String bioid) throws Exception {

    BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
    long bid = Long.parseLong(bioid);
    String thisLine;
    while((thisLine = br.readLine( )) != null)
    {
         if (Long.parseLong(thisLine) == bid)
            return true;

    }
        return false;
    }

public void getMBD() throws Exception{

     BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
     OutputStream os = new FileOutputStream("C://MBD.txt");
     PrintWriter pr = new PrintWriter(os);
     String thisLine;
     int count=1;
     while ((thisLine = br.readLine( )) != null){
         String bioid = thisLine;
         System.out.println(count);
         if(! hasIDin(bioid))
                pr.println(bioid);
     count++;
     }
    pr.close();
}  

When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!

Is there any ideas to make it done in a 开发者_如何学编程much much more less time.

Thanks in advance.


Why don't you;

  • read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
  • for each line in the second file see if it is in the set.

Here is a tuned implementation which prints the following and uses < 64 MB.

Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids

Code

public static void main(String... args) throws IOException {
    generateFile("/tmp/DID.txt", 1400000);
    generateFile("/tmp/AllIDs.txt", 16000000);

    long start = System.currentTimeMillis();
    TLongHashSet did = readLongs("/tmp/DID.txt");
    TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);

    long time = System.currentTimeMillis() - start;
    System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}

private static TLongHashSet readLongs(String filename) throws IOException {
    System.out.println("Reading ids in "+filename);
    BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
    TLongHashSet ids = new TLongHashSet();
    for(String line; (line = br.readLine())!=null;)
        ids.add(Long.parseLong(line));
    br.close();
    return ids;
}

private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
    System.out.println("Reading ids in "+filename);
    BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
    TLongHashSet ids = new TLongHashSet();
    for(String line; (line = br.readLine())!=null;) {
        long val = Long.parseLong(line);
        if (validSet.contains(val))
            ids.add(val);
    }
    br.close();
    return ids;
}

private static void generateFile(String filename, int number) throws IOException {
    System.out.println("Generating "+number+" ids to "+filename);
    PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
    Random rand = new Random();
    for(int i=0;i<number;i++)
        pw.println(rand.nextInt(1<<26));
    pw.close();
}


170Mb + 14Mb is not so huge files. My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.

P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?


You can't do an O(N^2) when each iteration is so long, that's completely unacceptable.

If you have enough RAM, you simply parse file 1, create a map of all numbers, then parse file 2 and check your map.

If you don't have enough RAM, parse file 1, create a map and store it to a file, then parse file 2 and read the map. The key is to make the map as easy to parse as possible - make it a binary format, maybe with a binary tree or something where you can quickly skip around and search. (EDIT: I have to add Michael Borgwardt's Grace Hash Join link, which shows an even better way: http://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)

If there is a limit to the size of your files, option 1 is easier to implement - unless you're dealing with huuuuuuuge files (I'm talking lots of GB), you definitely want to do that.


Usually, memory-mapping is the most efficient way to read large files. You'll need to use java.nio.MappedByteBuffer and java.io.RandomAccessFile.

But your search algorithm is the real problem. Building some sort of index or hash table is what you need.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜