efficient way to compress a numpy array (python)
I am looking for an efficient way to compress a numpy array.
I have an array like: dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]
(my favourite example).
if I'm doing something like this: my_array.compress(my_array['income'] > 10000)
I'm getting a new array with only incomes > 10000, and it's quite quick.
But if I would like to filter jobs in list: it doesn't work!
my__array.compress(m_y_array['job'] in ['this', 'that'])
Error:
ValueError: The truth va开发者_运维问答lue of an array with more than one element is ambiguous. Use a.any() or a.all()
So I have to do something like this:
np.array([x for x in my_array if x['job'] in ['this', 'that'])
This is both ugly and inefficient!
Do you have an idea to make it efficient?
It's not quite as nice as what you'd like, but I think you can do:
mask = my_array['job'] == 'this'
for condition in ['that', 'other']:
mask = numpy.logical_or(mask,my_array['job'] == condition)
selected_array = my_array[mask]
The best way to compress a numpy array is to use pytables. It is the defacto standard when it comes to handling a large amount of numerical data.
import tables as t
hdf5_file = t.openFile('outfile.hdf5')
hdf5_file.createArray ......
hdf5_file.close()
If you're looking for a numpy-only solution, I don't think you'll get it. Still, although it does lots of work under the covers, consider whether the tabular package might be able to do what you want in a less "ugly" fashion. I'm not sure you'll get more "efficient" without writing a C extension yourself.
By the way, I think this is both efficient enough and pretty enough for just about any real case.
my_array.compress([x in ['this', 'that'] for x in my_array['job']])
As an extra step in making this less ugly and more efficient, you would presumably not have a hardcoded list in the middle, so I would use a set instead, as it's much faster to search than a list if the list has more than a few items:
job_set = set(['this', 'that'])
my_array.compress([x in job_set for x in my_array['job']])
If you don't think this is efficient enough, I'd advise benchmarking so you'll have confidence that you're spending your time wisely as you try to make it even more efficient.
精彩评论