Python data structure for sparse list of x/y co-ordinates
Consider list of x/y co-ordinates and a byte 'count'. x/y will have a range of perhaps 0 to 5000 which is 25 million cells.
However the data will be quite sparsely populated, there will be at most a few thousand entries and the majority of co-ordinates will have zero entries.
The structure will be occasionally looked up/added to (e.g. if there is something in x=5 and y=10 then ++) but more frequently converted into a list of x/y/count (sorting is not important)
The fastest data 开发者_运维百科structure is for lookup is obviously a 2d array, but you're looking at 24 MB of memory or so and the iteration to output a list could be expensive. For disk storage you could implement gif style compression where a 0 byte followed by another byte indicates x empty cells and anything else is a cell value - but this doesn't help the memory situation.
A dictionary of dictionary's would probably be a good balance between lookup/iteration speed and memory usage.
Are there any other suitable data structures I should be considering (either built in to Python, existing libraries or more general data structures?
A dictionary keyed by a point (ie a 2-tuple) sound good to me. It's O(1) like an array, and significantly more compact. As long as you never need to do range queries or the like, it should be fine.
# increment
p = (x, y)
counts[p] = counts.get(p, 0) + 1
# list
for (p, count) in counts.iteritems():
x, y = p
print x, y, count
scipy has a range of different sparse arrays
There are seven available sparse matrix types:
csc_matrix: Compressed Sparse Column format
csr_matrix: Compressed Sparse Row format
bsr_matrix: Block Sparse Row format
lil_matrix: List of Lists format
dok_matrix: Dictionary of Keys format
coo_matrix: COOrdinate format (aka IJV, triplet format)
dia_matrix: DIAgonal format
This should be similar to working with sparse matrices of the size of your data range, there's plenty of stuff to chew on here http://en.wikipedia.org/wiki/Sparse_matrix
精彩评论