开发者

How can I store and query 50 million simple Python dicts?

Question

Each dict has a unique numeric id field but the rest are variable (all text or numeric values though). Some simple query functionality would be nice e.g. get where name contains 'abc' or where a < 123.

What are my options?

S开发者_运维技巧etup

  • Python 2.6
  • 32-bit Ubuntu-server 256MB
  • single machine
  • no swap

example dict:

{'id': 3823942, 'name': u'example', 'unknown_variable_key': u'foo'}

Notes

can't use MongoDB: 32-bit MongoDB processes are limited to about 2.5 gb of data

can't use Redis because it needs the whole dataset in RAM


You can try using CouchDB. It's a document-oriented database. There are examples (e.g. here) of storing millions and more documents.

There is a good guide to CouchDB.

Your numeric id field can be used as a document id.

You can try writing a python script that fills the database using bulk APIs (to make insertion of that number of objects faster)

Views can be used to make various queries.

[edit]

I was curious about CouchDB memory consumption and made a simple experiment.

I've inserted 5kk documents in chunks of 50k docs in an empty database. It took about 15-20 minutes on my Code Duo MacBook. During the insertion maximum memory consumption by the CouchDB process was 120MB when inserting 50k docs in a chunk and around 60MB when inserting 25k chunks.

Final database size is around 1.2GB. Memory consumption of the idle CounchDB process is 40MB. Fetching single documents is instantaneous (using the Futon web frontend).

I was inserting documents of the following format (JSON generated with Python):

'{"_id" : "%09d", "name" : "example", "field%d" : "hello"}' % 
    (num, random.randint(0, 100))

or

{"_id" : "005049980", "name" : "example", "field77" : "hello"}

The _id is used as a key and documents are sorted by that key.

About sample queries that you've provided. The are not efficient because in general you need to traverse the whole document set to get the result. But both can be optimized after creating special views.


Pytables might be an option for you http://www.pytables.org/moin


Since the dataset is too large to be all in memory, you're most likely limited to solutions that use the disk. The simplest way to store is to serialize the dict and store as files, and there's some simple optimizations you can do for indexing so you don't have to search through the entire dataset on a query. If you have access to third-party database (mysql or even sqlite3), you can store the dictionaries into a table with columns id, key, value (so each dict maps to multiple rows in the table) and create index on id and key for queries.


Try using littletable. It works with a list of objects, infers column names from object attributes, and supports simple query, join, and pivot table operations. One of the nice things is that the results of query and join operations are new, full-fledged, queryable littletables.

littletable hasn't gotten a lot of attention or use yet, so I'd be very curious to hear your feedback/experience with it.


Have you tried OrientDB? It's a fast document-graph NoSQL engine, support SQL and transactions. There are no space limits on 32-bit systems even if on 64-bit is faster.

I've just tried to insert 1 million of indexed JSON documents in a fresh OrientDB database. Records are very simple:

{ 'id' : 0, 'name' : 'Gipsy', 'type' : 'Cat', 'race' : 'European', 'country' : 'Italy', 'price' : 300.00 }

But with id and price that increment by 1 at every cycle. It took 21 seconds on my 3-years old notebook. Heap memory settings are 256Mb.

The query:

SELECT from TEST where id = 30000

Took 0,01 second!

The code used for insertion has been taken from this example: http://code.google.com/p/orient/source/browse/trunk/tests/src/test/java/com/orientechnologies/orient/test/database/speed/DictionaryPutSpeedTest.java

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜