开发者

Java: Filling in-memory sorted batches

So I'm using Java to do multi-way external merge sorts of large on-disk files of line-delimited tuples. Batches of tuples are read into a TreeSet, which are then dumped into on-disk sorted batches. Once all of the data have been exhausted, these batches are then merge-sorted to the output.

Currently I'm using magic numbers for figuring out how many tuples we can fit into memory. This is based on a static figure indicating how may tuples can be roughly fit per MB of heap space, and how much heap space is available using:

long max = Runtime.getRuntime().maxMemory();
long used = Runtime.getRuntime().totalMemory();
long free = Runtime.getRuntime().freeMemory();      
long space = free + (max - used);

However, this does not always work so well since we may be sorting different length tuples (for which the static tuple-per-MB figure might be too conservative) and I now want to use flyweight patterns to jam more in there, which may make the figure even more variable.

So I'm looking for a better way to fill the heap-space to the brim. Ideally the solution should be:

  • reliable (no risk of heap-space exceptions)开发者_如何学编程
  • flexible (not based on static numbers)
  • efficient (e.g., not polling runtime memory estimates after every tuple)

Any ideas?


Filling the heap to the brim might be a bad idea due to garbage collector trashing. (As the memory gets nearly full, the efficiency of garbage collection approaches 0, because the effort for collection depends on heap size, but the amount of memory freed depends on the size of the objects identified as unreachable).

However, if you must, can't you simply do it as follows?

for (;;) {
    long freeSpace = getFreeSpace();
    if (freeSpace < 1000000) break;
    for (;;freeSpace > 0) {
        treeSet.add(readRecord());
        freeSpace -= MAX_RECORD_SIZE;
    }
}

The calls to discover the free memory will be rare, so shouldn't tax performance much. For instance, if you have 1 GB heap space, and leave 1MB empty, and MAX_RECORD_SIZE is ten times average record size, getFreeSpace() will be invoked a mere log(1000) / -log(0.9) ~= 66 times.


Why bother with calculating how many items you can hold? How about letting java tell you when you've used up all your memory, catching the exception and continuing. For example,

    // prepare output medium now so we don't need to worry about having enough 
    // memory once the treeset has been filled.
    BufferedWriter writer = new BufferedWriter(new FileWriter("output"));

    Set<?> set = new TreeSet<?>();
    int linesRead = 0;
    {
        BufferedReader reader = new BufferedReader(new FileReader("input"));
        try {
            String line = reader.readLine();
            while (reader != null) {
                set.add(parseTuple(line));
                linesRead += 1;
                line = reader.readLine();
            }
            // end of file reached
            linesRead = -1;
        } catch (OutOfMemoryError e) {
            // while loop broken
        } finally {
            reader.close();
        }
        // since reader and line were declared in a block their resources will 
        // now be released 
    }

    // output treeset to file
    for (Object o: set) {
        writer.write(o.toString());
    }
    writer.close();

    // use linesRead to find position in file for next pass
    // or continue on to next file, depending on value of linesRead

If you still have trouble with memory, just make the reader's buffer extra large so as to reserve more memory.

The default size for the buffer in a BufferedReader is 4096 bytes. So when finishing reading you will release upwards of 4k of memory. After this your additional memory needs will be minimal. You need enough memory to create an iterator for the set, let's be generous and assume 200 bytes. You will also need memory to store the string output of your tuples (but only temporarily). You say the tuples contain about 200 characters. Let's double that to take account for separators -- 400 characters, which is 800 bytes. So all you really need is an additional 1k bytes. So you're fine as you've just released 4k bytes.

The reason you don't need to worry about the memory used to store the string output of your tuples is because they are short lived and only referred to within the output for loop. Note that the Writer will copy the contents into its buffer and then discard the string. Thus, the next time the garbage collector runs the memory can be reclaimed.

I've checked and, a OOME in add will not leave a TreeSet in an inconsistent state, and the memory allocation for a new Entry (the internal implementation for storing a key/value pair) happens before the internal representation is modified.


You can really fill the heap to the brim using direct memory writing (it does exist in Java!). It's in sun.misc.Unsafe, but isn't really recommended for use. See here for more details. I'd probably advise writing some JNI code instead, and using existing C++ algorithms.


I'll add this as an idea I was playing around with, involving using a SoftReference as a "sniffer" for low memory.

SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
   tuple = iter.next();
   treeset.add(tuple);
   if(sniffer.get()==null){
      dump(treeset);
      treeset.clear();
      sniffer = new SoftReference<String>(new Byte[8192]);
   }
}

This might work well in theory, but I don't know the exact behaviour of SoftReference.

All soft references to softly-reachable objects are guaranteed to have been cleared before the virtual machine throws an OutOfMemoryError. Otherwise no constraints are placed upon the time at which a soft reference will be cleared or the order in which a set of such references to different objects will be cleared. Virtual machine implementations are, however, encouraged to bias against clearing recently-created or recently-used soft references.

Would like to hear feedback as it seems to me like an elegant solution, although behaviour might vary between VMs?

Testing on my laptop, I found that it the soft-reference is cleared infrequently, but sometimes is cleared too early, so I'm thinking to combine it with meriton's answer:

SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
   tuple = iter.next();
   treeset.add(tuple);
   if(sniffer.get()==null){
      free = MemoryManager.estimateFreeSpace();
      if(free < MIN_SAFE_MEMORY){
         dump(treeset);
         treeset.clear();
         sniffer = new SoftReference<String>(new Byte[8192]);
      }
   }
}

Again, thoughts welcome!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜