开发者

Python: quickly loading 7 GB of text files into unicode strings

I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)

Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:

mount -t tmpfs none /var/ramdisk -o size=7g开发者_高级运维

Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .

Thanks!


There is much that is confusing here, which makes it more difficult to answer this question:

  • The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
  • The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
  • Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?

So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).

If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:

with codecs.open(inpath, encoding='utf-8') as f:
    data = f.read(8192)
    while data:
        process(data)
        data = f.read(8192)

I hope this at least gets you closer.


I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/

If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜