Compressing measurements data files
from measurements I get text files that basically contain a table of float numbers, with the dimensions 1000x1000. Those take up about 15MB of space which, considering that I get about 1000 result files i开发者_StackOverflown a series, is unacceptable to save. So I am trying to compress those by as much as possible without loss of data. My idea is to group the numbers into ~1000 steps over the range I expect and save those. That would provide sufficient resolution. However I still have 1.000.000 points to consider and thus my resulting file is still about 4MB. I probably won t be able to compress that any further? The bigger problem is the calculation time this takes. Right now I d guesstimate 10-12 secs per file, so about 3 hrs for the 1000 files. WAAAAY to much. This is the algorithm I thougth up, do you have any suggestions? There's probably far more efficient algorithms to do that, but I am not much of a programmer...
import numpy
data=numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
out=numpy.empty((1000,1000),numpy.int16)
i=0
min=-0.5
max=0.5
step=(max-min)/1000
while i<=999:
j=0
while j<=999:
k=(data[i,j]//step)
out[i,j]=k
if data[i,j]>max:
out[i,j]=500
if data[i,j]<min:
out[i,j]=-500
j=j+1
i=i+1
numpy.savetxt('converted.txt', out, fmt="%i")
Thanks in advance for any hints you can provide! Jakob
I see you store the numpy arrays as text files. There is a faster and more space-efficient way: just dump it.
If your floats can be stored as 32-bit floats, then use this:
data = numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
data.astype(numpy.float32).dump(open('converted.numpy', 'wb'))
then you can read it with
data = numpy.load(open('converted.numpy', 'rb'))
The files will be 1000x1000x4
Bytes, about 4MB.
The latest version of numpy supports 16-bit floats. Maybe your floats will fit in its limiter range.
numpy.savez_compressed
will let you save lots of arrays into a single compressed, binary file.
However, you aren't going to be able to compress it that much -- if you have 15GB of data, you're not magically going to fit it in 200MB by compression algorithms. You have to throw out some of your data, and only you can decide how much you need to keep.
Use the zipfile, bz2 or gzip module to save to a zip, bz2 or gz file from python. Any compression scheme you write yourself in a reasonable amount of time will almost certainly be slower and have worse compression ratio than these generic but optimized and compiled solutions. Also consider taking eumiro's advice.
精彩评论