Reading a binary file in Python: takes a very long time to read certain bytes
This is very odd
I'm reading some (admittedly very large: ~2GB each) binary files using numpy libraries in Python. I'm using the:
thingy = np.fromfile(fileObject, np.int16, 1)
method. This is right in the middle of a nested loop - I'm doing this loop 4096 times per 'channel', and this 'channel' loop 9 times for every 'receiver', and this 'receiver' loop 4 开发者_开发百科times (there's 9 channels per receiver, of which there are 4!). This is for every 'block', of which there are ~3600 per file.
So you can see, very iterative and I know it will take a long time, but it was taking a LOT longer than I expected - on average 8.5 seconds per 'block'.
I ran some benchmarks using time.clock() etc. and found everything going as fast as it should be, except for approximately 1 or 2 samples per 'block' (so 1 or 2 in 4096*9*4) where it would seem to get 'stuck' on for a few seconds. Now this should be a case of returning a simple int16 from binary, not exactly something that should be taking seconds... why is it sticking?
From the benchmarking I found it was sticking in the SAME place every time, (block 2, receiver 8, channel 3, sample 1085 was one of them, for the record!), and it would get stuck there for approximately the same amount of time each run.
Any ideas?!
Thanks,
Duncan
Although it's hard to say without some kind of reproducible sample, this sounds like a buffering problem. The First part is buffered and until you reach the end of the buffer, it is fast; then it slows down until the next buffer is filled, and so on.
Where are you storing the results? When lists/dicts/whatever get very large there can be a noticeable delay when they need to be reallocated and resized.
Could it be that garbage collection is kicking in for the lists ?
Added: is it funny data, or blockno ? What happens if you read the blocks in random order, along the lines
r = range(4096)
random.shuffle(r) # inplace
for blockno in r:
file.seek( blockno * ... )
...
精彩评论