开发者

Google Protocol Buffers, HDF5, NumPy comparison (transferring data)

I need help to make decision. I have a need to transfer some data in my application and have to make a choice between these 3 technologies. I've read about all technologies a little bit (tutorials, documentation) but still can't decide...

How do they compare?

I need support of metadata (capability to receive file and read it without any additional information/files), fast read/write operations, capability to store dynamic data will be a plus (like Python objects)

Things I already know开发者_如何学Go:

  • NumPy is pretty fast but can't store dynamic data (like Python objects). (What about metadata?)
  • HDF5 is very fast, supports custom attributes, is easy to use, but can't store Python objects. Also HDF5 serializes NumPy data natively, so, IMHO, NumPy has no advantages over HDF5
  • Google Protocol Buffers support self-describing too, are pretty fast (but Python support is poor at present time, slow and buggy). CAN store dynamic data. Minuses - self-describing don't work from Python and messages that are >= 1 MB are serializing/deserializing not very fast (read "slow").

PS: data I need to transfer is "result of work" of NumPy/SciPy (arrays, arrays of complicated structs, etc.)

UPD: cross-language access required (C/C++/Python)


There does seem to be a slight contradiction in your question - you want to be able to store Python objects, but you also want C/C++ access. I think that regardless of which choice you go with, you will need to convert your fancy Python data structures into more static structures such as arrays.

If you need cross-language access, I would suggest using HDF5 as it is a file format which is specifically designed to be independent of language, operating system, system architecture (e.g. on loading it can convert between big-endian and little-endian automatically) and is specifically aimed at users doing scientific/numerical computing. I don't know much about Google Protocol Buffers, so I can't really comment too much on that.

If you decide to go with HDF5, I would also recommend that you use h5py instead of pytables. This is because pytables creates HDF5 files with a whole lot of extra pythonic metadata which makes reading the data in C/C++ a bit more of a pain, whereas h5py doesn't create any of these extras. You can find a comparison here, and they also give a link to the pytables FAQ for their view on the matter so you can decide which suits your needs best.

Another format which is very similar to HDF5 is NetCDF. This also has Python bindings, however I have no experience in using this format so I cannot really comment beyond pointing out that it exists and is also widely used in scientific computing.


I don't know about HDF5, but you can store Python objects in NumPy arrays, you just lose all the important functionality by disallowing C-level operations to be performed on the array.

In [17]: x = np.zeros(10, dtype=np.object)
In [18]: x[3] = {'pants', 10}
In [19]: x
Out[19]: array([0, 0, 0, set([10, 'pants']), 0, 0, 0, 0, 0, 0], dtype=object)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜