开发者

efficient way to compress a numpy array (python)

I am looking for an efficient way to compress a numpy array. I have an array like: dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] (my favourite example).

if I'm doing something like this: my_array.compress(my_array['income'] > 10000) I'm getting a new array with only incomes > 10000, and it's quite quick.

But if I would like to filter jobs in list: it doesn't work!

my__array.compress(m_y_array['job'] in ['this', 'that'])

Error:

ValueError: The truth va开发者_运维问答lue of an array with more than one element is ambiguous. Use a.any() or a.all()

So I have to do something like this:

np.array([x for x in my_array if x['job'] in ['this', 'that'])

This is both ugly and inefficient!

Do you have an idea to make it efficient?


It's not quite as nice as what you'd like, but I think you can do:

mask = my_array['job'] == 'this'
for condition in ['that', 'other']:
  mask = numpy.logical_or(mask,my_array['job'] == condition)
selected_array = my_array[mask]


The best way to compress a numpy array is to use pytables. It is the defacto standard when it comes to handling a large amount of numerical data.

import tables as t
hdf5_file = t.openFile('outfile.hdf5')
hdf5_file.createArray ......
hdf5_file.close()


If you're looking for a numpy-only solution, I don't think you'll get it. Still, although it does lots of work under the covers, consider whether the tabular package might be able to do what you want in a less "ugly" fashion. I'm not sure you'll get more "efficient" without writing a C extension yourself.

By the way, I think this is both efficient enough and pretty enough for just about any real case.

my_array.compress([x in ['this', 'that'] for x in my_array['job']])

As an extra step in making this less ugly and more efficient, you would presumably not have a hardcoded list in the middle, so I would use a set instead, as it's much faster to search than a list if the list has more than a few items:

job_set = set(['this', 'that'])
my_array.compress([x in job_set for x in my_array['job']])

If you don't think this is efficient enough, I'd advise benchmarking so you'll have confidence that you're spending your time wisely as you try to make it even more efficient.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜