开发者

python inserting large element in sqlite3 db - memory error

I am doing numerical simulations in python leading to quite large data objects (approx 200 MB). I write them into a sqlite3 db. After increasing resolution (and thus the data size) by about 20% I get a memory error while trying to insert into the db. Worked just fine with the smaller resolution before. Here is a code snippet:

def write_to_db(self, filename, dataObject, name) :
    connection  = sqlite.connect(filename)
    cursor     开发者_Go百科 = connection.cursor()
    cursor.execute("CREATE TABLE pulses (ID INTEGER PRIMARY KEY, name STRING, data BLOB)")
    cursor.execute("INSERT INTO pulses(name, data) VALUES (?, ?)", (dataObjectName, sqlite.Binary(pickle.dumps(dataObject))))
    connection.commit()
    connection.close()

I am working under winXP, 1GB of RAM, 3GB swap (and did not receive a windows-notice that swap needs to be extended), Python 2.6.

Thanks in advance for your help. Tim


Note that XP will only increase swap in smaller increments. If your program suddenly tries to use a lot more memory and there is insufficient swap then you will get the memory error.

SQLite will work quite happily with blobs up to 1GB in size and you can usually use 2GB. However you will run out of address space in a 32 bit process.

It is generally recommended for larger items that you store the large data in a file and then store the filename in the database, but that is more work for you.

You can solve your immediate problem by doing the following:

  • Switch to using 64 bit. Microsoft sell a Windows 7 Family Pack that includes upgrades for 3 instances of XP/Vista for $150 (street price $130) so you can upgrade several machines. You can switch from 32 bit XP to 64 bit Win 7 this way. Just doing this will immediately solve your problem even if you do not change the amount of RAM etc.

  • Add -1 to the pickle call which tells it to use the most recent pickle protocol that uses binary rather than the ascii default encoding. You will get less data as a result. Read the doc for more info about protocol versions and which Python versions support them.

  • Also compress the pickled data - eg bz2.compress(pickle.dumps(obj, -1)

The most likely cause of issues you have here is running out of address space. A 32 bit process will generally only be able to address 2GB of data at the same time and the various executables and shared libraries, stacks for each thread, SQLite cache etc are also subtracted from that space. You will need to pay careful attention to all the various data items you have and when they are alive. Calling del and gc.collect() when you are done with them proactively will help reduce the amount of concurrent data in use.


You ask in a comment about possible alternative storage mechanisms. If the data you're storing is a fifth of the available RAM, I'd say that you need to manage memory very carefully, since you can't have the data copied much before you run out of RAM; and in any case you're going to be killing performance. You may be able to get sqlite to do the right thing for you with BLOBs since they are expected to potentially be large.

It looks like you're using sqlite as a simple key/value store. Have you considered using flat files? If you need ATOMicity, you could still use an sqlite database to define which files in your flat file store are valid. You can safely do this by committing to the DB only after writing all flat files cleanly, and removing flat files only after committing the corresponding deletion in the DB.

To make this work, you'll need some mechanism to serialise your dataObject to a file-type object in Python. Pickle can do this for you if you pass it a file-type object, but I suspect that it'll still be quite inefficient.

You say you're doing numerical simulations; are you aware of numpy? numpy arrays support a tofile function that will be more efficient than pickling, and you might get a significant performance boost in your simulations if you're not using it already.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜