开发者

Appending large amount of data to a tables (HDF5) database where database.numcols != newdata.numcols?

I am trying to append a large dataset (>30Gb) to an existing pytables table. The table is N columns, and the dataset is N-1 columns; one column is calculated after I know the other N-1 columns.

I'm using numpy.fromfile() to read chunks of the dataset into memory before appending it to the database. Ideally, I'd lik开发者_Go百科e to stick the data into the database, then calculate the final column, and finish up by using Table.modifyColumn() to complete the operation.

I've considered appending numpy.zeros((len(new_data), N)) to the table, then using Table.modifyColumns() to fill in the new data, but I'm hopeful someone knows a nice way to avoid generating a huge array of empty data for each chunk that I need to append.


If the columns are all the same type, you can use numpy.lib.stride_tricks.as_strided to make the array you read from the file of shape (L, N-1) to look like shape (L, N). For example,

In [5]: a = numpy.arange(12).reshape(4,3)

In [6]: a
Out[6]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [7]: a.strides
Out[7]: (24, 8)

In [8]: b = numpy.lib.stride_tricks.as_strided(a, shape=(4, 4), strides=(24, 8))

In [9]: b
Out[9]: 
array([[  0,   1,   2,   3],
       [  3,   4,   5,   6],
       [  6,   7,   8,   9],
       [  9,  10,  11, 112]])

Now you can use this array b to fill up the table. The last column of each row will be the same as the first column of the next row, but you'll overwrite them when you can compute the values.

This won't work if a is record array (i.e. has a complex dtype). For that, you can try numpy.lib.recfunctions.append_fields. As it will copy the data to a new array, it won't save you any significant amount of memory, but it will allow you to do all the writing at once.


You could add the results to another table. Unless there's some compelling reason for the calculated column to be adjacent to the other columns, that's probably the easiest. There's something to be said for separating raw data from calculations anyways.

If you must increase the size of the table, look into using h5py. It provides a more direct interface to the h5 file. Keep in mind that depending on how the data set was created in the h5 file, it may not be possible to simply append a column to the data. See section 1.2.4, "Dataspace" in http://www.hdfgroup.org/HDF5/doc/UG/03_DataModel.html for a discussion regarding the general data format. h5py supports resize if the underlying dataset supports it.

You could also use a single buffer to store the input data like so:

z = zeros((nrows, N))
while more_data_in_file:
    # Read a data block
    z[:,:N-1] = fromfile('your_params')
    # Set the final column
    z[:,N-1:N] = f(z[:,:N-1])
    # Append the data
    tables_handle.append(z)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜