开发者

what is a fast way to output h5py dataset to text?

I am using the h5py python package to read files in HDF5 format. (e.g. somefile.h5) I would like to write the contents of a dataset to a text file.

For example, I would like to create a text file with the following contents: 1,20,31,75,142,324,78,12,3,90,8,21,1

I am able to access the dataset in python using this code:

import h5py
f     = h5py.File('/Users/Me/Desktop/thefile.h5', 'r')
group = f['/level1/level2/level3']
dset  = group['dsetname']

My naive approach is too slow, because my dataset has over 20000 entries:

# write all values to file        
for index in range(len(dset)):
        # do not add comma after last value
       开发者_开发百科 if index == len(dset)-1: txtfile.write(repr(dset[index]))
        else:                    txtfile.write(repr(dset[index])+',')
txtfile.close()
    return None

Is there a faster way to write this to a file? Perhaps I could convert the dataset into a NumPy array or even a Python list, and then use some file-writing tool?

(I could experiment with concatenating the values into a larger string before writing to file, but I'm hoping there's something entirely more elegant)


Building a large string has the huge advantage of saving the need for the goofy "last-time switch" thanks to the excellent join method of strings: to replace your whole loop,

txtfile.write(','.join(repr(item) for item in dset))

I'm not sure how much more elegant you demand your code to be...;-)


Your original suspicion was correct, first convert it to a Numpy array, and then dump that array to ASCII.

my_data = my_h5_group['dsetname'].value # is now a Numpy array
my_data.tofile("my_data.txt")

This will be dramatically faster than iterating over the group object itself.


maybe use h5dump on the HDF5 file?

I use (bash)

(h5dump -y -o /dev/stderr -d $dataset $infile >$errorout) 2>&1 | sed -e 's/, /\n/g' -e 's/,$//' | sed 's/ //g' > $outfile 2> $errorout


Oh I do the same thing and I find the way. If you want to access for example like this

print( hdf5['a'][i][j][k] )

This is very very very slow.Do like this.

arr=hdf5[:] # at the out of loop
print( arr[i][j][k] ) # in the loop

Only this slight change will make success.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜