How to dump a boolean matrix in numpy?
I have a graph represented as a numpy boolean array (G.adj.dtype == bool
). This is homework in writing my own graph library, so I can't use networkx. I want to dump it to a file so that I can fiddle with it, but for the life of me I can't work out how to make numpy dump it in a recoverable fashion.
I've tried G.adj.tofile
, which wrote the graph correctly (ish) as one long line of True/False. But fromfile
barfs on re开发者_运维技巧ading this, giving a 1x1 array, and loadtxt
raises a ValueError: invalid literal for int
. np.savetxt
works but saves the matrix as a list of 0/1 floats, and loadtxt(..., dtype=bool
) fails with the same ValueError.
Finally, I've tried networkx.from_numpy_matrix
with networkx.write_dot
, but that gave each edge [weight=True]
in the dot source, which broke networkx.read_dot
.
To save:
numpy.savetxt('arr.txt', G.adj, fmt='%s')
To recover:
G.adj = numpy.genfromtxt('arr.txt', dtype=bool)
HTH!
This is my test case:
m = numpy.random(100,100) > 0.5
space efficiency
numpy.savetxt('arr.txt', obj, fmt='%s')
creates a 54 kB file.
numpy.savetxt('arr.txt', obj, fmt='%d')
creates a much smaller file (20 kB).
cPickle.dump(obj, open('arr.dump', 'w'))
, which creates a 40kB file,
time efficiency
numpy.savetxt('arr.txt', obj, fmt='%s')
45 ms
numpy.savetxt('arr.txt', obj, fmt='%d')
10 ms
cPickle.dump(obj, open('arr.dump', 'w'))
, 2.3 ms
conclusion
use savetxt
with text format (%s
) if human readability is needed, use savetxt
with numeric format (%d
) if space consideration are an issue and use cPickle
if time is an issue.
The easiest way to save your array including metadata (dtype, dimensions) is to use numpy.save()
and numpy.load()
:
a = array([[False, True, False],
[ True, False, True],
[False, True, False],
[ True, False, True],
[False, True, False]], dtype=bool)
numpy.save("data.npy", a)
numpy.load("data.npy")
# array([[False, True, False],
# [ True, False, True],
# [False, True, False],
# [ True, False, True],
# [False, True, False]], dtype=bool)
a.tofile()
and numpy.fromfile()
would work as well, but don't save any metadata. You need to pass dtype=bool
to fromfile()
and will get a one-dimensional array that must be reshape()
d to its original shape.
I know that question is quite old, but I want to add Python 3 benchmarks. It is a bit different than previous one.
Firstly I load a lot of data to memory, convert it to int8
numpy array with only 0
and 1
as possible values and then dump it to HDD using two approaches.
from timer import Timer
import numpy
import pickle
# Load data part of code is omitted.
prime = int(sys.argv[1])
np_table = numpy.array(check_table, dtype=numpy.int8)
filename = "%d.dump" % prime
with Timer() as t:
pickle.dump(np_table, open("dumps/pickle_" + filename, 'wb'))
print('pickle took %.03f sec.' % (t.interval))
with Timer() as t:
numpy.savetxt("dumps/np_" + filename, np_table, fmt='%d')
print('savetxt took %.03f sec.' % (t.interval))
Time measuring
It took 50.700 sec to load data number 11
pickle took 0.010 sec.
savetxt took 1.930 sec.
It took 1297.970 sec to load data number 29
pickle took 0.070 sec.
savetxt took 242.590 sec.
It took 1583.380 sec to load data number 31
pickle took 0.090 sec.
savetxt took 334.740 sec.
It took 3855.840 sec to load data number 41
pickle took 0.610 sec.
savetxt took 1367.840 sec.
It took 4457.170 sec to load data number 43
pickle took 0.780 sec.
savetxt took 1654.050 sec.
It took 5792.480 sec to load data number 47
pickle took 1.160 sec.
savetxt took 2393.680 sec.
It took 8101.020 sec to load data number 53
pickle took 1.980 sec.
savetxt took 4397.080 sec.
Size measuring
630K np_11.dump
79M np_29.dump
110M np_31.dump
442M np_41.dump
561M np_43.dump
875M np_47.dump
1,6G np_53.dump
315K pickle_11.dump
40M pickle_29.dump
55M pickle_31.dump
221M pickle_41.dump
281M pickle_43.dump
438M pickle_47.dump
798M pickle_53.dump
So Python 3 pickle
version is much faster than numpy.savetxt
and is using about 2 times less HDD volume.
精彩评论