How can I store and query 50 million simple Python dicts?

2023-02-05 12:09 问答作者：

Question

Each dict has a unique numeric id field but the rest are variable (all text or numeric values though). Some simple query functionality would be nice e.g. get where name contains 'abc' or where a < 123.

What are my options?

S开发者_运维技巧etup

Python 2.6
32-bit Ubuntu-server 256MB
single machine
no swap

example dict:

{'id': 3823942, 'name': u'example', 'unknown_variable_key': u'foo'}

Notes

can't use MongoDB: 32-bit MongoDB processes are limited to about 2.5 gb of data

can't use Redis because it needs the whole dataset in RAM

You can try using CouchDB. It's a document-oriented database. There are examples (e.g. here) of storing millions and more documents.

There is a good guide to CouchDB.

Your numeric id field can be used as a document id.

You can try writing a python script that fills the database using bulk APIs (to make insertion of that number of objects faster)

Views can be used to make various queries.

[edit]

I was curious about CouchDB memory consumption and made a simple experiment.

I've inserted 5kk documents in chunks of 50k docs in an empty database. It took about 15-20 minutes on my Code Duo MacBook. During the insertion maximum memory consumption by the CouchDB process was 120MB when inserting 50k docs in a chunk and around 60MB when inserting 25k chunks.

Final database size is around 1.2GB. Memory consumption of the idle CounchDB process is 40MB. Fetching single documents is instantaneous (using the Futon web frontend).

I was inserting documents of the following format (JSON generated with Python):

'{"_id" : "%09d", "name" : "example", "field%d" : "hello"}' % 
    (num, random.randint(0, 100))

{"_id" : "005049980", "name" : "example", "field77" : "hello"}

The _id is used as a key and documents are sorted by that key.

About sample queries that you've provided. The are not efficient because in general you need to traverse the whole document set to get the result. But both can be optimized after creating special views.

Pytables might be an option for you http://www.pytables.org/moin

Since the dataset is too large to be all in memory, you're most likely limited to solutions that use the disk. The simplest way to store is to serialize the dict and store as files, and there's some simple optimizations you can do for indexing so you don't have to search through the entire dataset on a query. If you have access to third-party database (mysql or even sqlite3), you can store the dictionaries into a table with columns id, key, value (so each dict maps to multiple rows in the table) and create index on id and key for queries.

Try using littletable. It works with a list of objects, infers column names from object attributes, and supports simple query, join, and pivot table operations. One of the nice things is that the results of query and join operations are new, full-fledged, queryable littletables.

littletable hasn't gotten a lot of attention or use yet, so I'd be very curious to hear your feedback/experience with it.

Have you tried OrientDB? It's a fast document-graph NoSQL engine, support SQL and transactions. There are no space limits on 32-bit systems even if on 64-bit is faster.

I've just tried to insert 1 million of indexed JSON documents in a fresh OrientDB database. Records are very simple:

{ 'id' : 0, 'name' : 'Gipsy', 'type' : 'Cat', 'race' : 'European', 'country' : 'Italy', 'price' : 300.00 }

But with id and price that increment by 1 at every cycle. It took 21 seconds on my 3-years old notebook. Heap memory settings are 256Mb.

The query:

SELECT from TEST where id = 30000

Took 0,01 second!

The code used for insertion has been taken from this example: http://code.google.com/p/orient/source/browse/trunk/tests/src/test/java/com/orientechnologies/orient/test/database/speed/DictionaryPutSpeedTest.java

继续阅读：nosql python

How can I store and query 50 million simple Python dicts?

Question

S开发者_运维技巧etup

Notes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Question

S开发者_运维技巧etup

Notes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？