开发者

Representing a 100K X 100K matrix in Java

How can I store a 100K X 100K matrix in Java?

I can't do that with a normal array declaration as it is throw开发者_如何学编程ing a java.lang.OutofMemoryError.


The Colt library has a sparse matrix implementation for Java.

You could alternatively use Berkeley DB as your storage engine.

Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.


If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.


Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...

Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...

public class SparseMatrix<T> {
    private T defaultValue;
    private int m;
    private int n;
    private Map<Integer, T> data = new TreeMap<Integer, T>();
    /// create a new matrix with m rows and n columns
    public SparseMatrix(int m, int n, T defaultValue) {
        this.m = m;
        this.n = n;
        this.defaultValue = defaultValue;
    }
    /// set value at [i,j] (row, col)
    public void setValueAt(int i, int j, T value) {
        if (i >= m || j >= n || i < 0 || j < 0) 
            throw new IllegalArgumentException(
                    "index (" + i + ", " +j +") out of bounds");        
        data.put(i * n + j, value);
    }
    /// retrieve value at [i,j] (row, col)
    public T getValueAt(int i, int j) {
        if (i >= m || j >= n || i < 0 || j < 0) 
            throw new IllegalArgumentException(
                    "index (" + i + ", " +j +") out of bounds");
        T value = data.get(i * n + j);
        return value != null ? value : defaultValue;
    }
}

A simple test-case illustrating the SparseMatrix' use would be:

public class SparseMatrixTest extends TestCase {
    public void testMatrix() {
        SparseMatrix<Float> matrix = 
            new SparseMatrix<Float>(100000, 100000, 0.0F);
        matrix.setValueAt(1000, 1001, 42.0F);
        assertTrue(matrix.getValueAt(1000,1001) == 42.0);
        assertTrue(matrix.getValueAt(1001,1000) == 0.0);        
    }   
}

This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).

Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)


100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?

Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.


There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.

If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.

If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.

If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.

  • One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.

  • A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.

  • A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)

  • A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.

  • A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.


You can upgrade to this machine:

http://www.azulsystems.com/products/compute_appliance.htm

864 processor cores and 768 GB of memory, only costs a single family house somewhere.


Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.

  • You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).

  • You could store pieces in an outside system, like memcached or memory-mapped buffers.

There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.


You should try an "external" package to handle matrices, I never did that though, maybe something like jama.


Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.

What are you trying to do? You may find that representing your data in a different way will be much more efficient.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜