开发者

Alternatives to keeping large lists in memory (python)

If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (other than not making a list that large) The list could exceed the memory but I have no way of knowing before hand. Once it starts exceeding 75% I would like to no longer keep the list in memory (or the new items anyway), is there开发者_开发知识库 a way to convert to a file based approach mid-stream?

What are the best (speed in and out) file storage options?

Just need to store a simple list of numbers. no need to random Nth element access, just append/pop type operations.


If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library array module as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array has very fast fromfile and tofile methods to facilitate the moving of data back and forth.

I.e., basically, assuming for example unsigned-long numbers, something like:

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend, whatever), but if pop and append are indeed all you need this should serve.


There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.

One possibility is to structure your data relationally, and store it in a SQLite database.


The answer is very much "it depends".

What are you storing in the lists? Strings? integers? Objects?

How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?

If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.

If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.

If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.

If you want random access writing then maybe a SQL database may be better.

Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.

edit: From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.


Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTables or h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!


Did you check shelve python module which is based on pickle?

http://docs.python.org/library/shelve.html


You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.


You can try blist: https://pypi.python.org/pypi/blist/

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.


Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.


What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜