开发者

Efficient (both time and space) dictionary database (unique word to uniq id and back)

I'm looking for a solution, which is capable of:

  • storing arbitrary sized unique words, along with their unique 64 bit unsigned integer identifier and a 32 or 64 bit unsigned int reference count
  • accessing the data quickly with these patterns:
    • for a lookup of a word, give back its uint64 identifier
    • for a lookup of an identifier, give back the word
  • inserting new records, preferably with auto incremented identifier and atomically incremented reference count, preferably in batch commits (meaning not word by word, each in a separate transaction, but several words in one committed transaction)
  • atomically deleting records, which has zero reference count (this could be done even with a rate limited full table scan, by iterating through all the records and deleting the ones with 0 refcount in a transaction)
  • storing a high amount of records on traditional spinning rust (hard disks), the record num开发者_StackOverflow社区ber is somewhere between 100 million and 1000 billion (1000*10^9)
  • the average word size is somewhere between 25-80 bytes
  • it would be good to have a python (for prototyping) and C interface, mainly embeddable, or an efficient "remote" (will be on localhost only) API

For example a MySQL schema would be something like this:

CREATE TABLE words (
    id SERIAL,
    word MEDIUMTEXT,
    refcnt INT UNSIGNED,
    INDEX(word(12)),
    PRIMARY KEY (id)
)

This of course works, but MySQL isn't up to this task, and due to the index needed for word searches, it stores redundant information needlessly.

During the search for the most efficient solution, I figured out the following so far: - because the words share a lot of commonality (most of them are plain dictionary words in various languages and character sets), something this: http://www.unixuser.org/~euske/doc/tcdb/index.html would be good - the best I could find so far is Tokyo Cabinet's TDB: packages.python.org/tokyocabinet-python/TDB.html, but I have to evaluate its performance, and possible setups (where to store what and use what kind of index where for best time and space efficiency)

Any ideas, algorithms, of even better, ready to use products and setups?

Thanks,


You might want to try Redis. It covers most if not all of your requirements. It has good performance for your use case, has atomic increment/decrement for creating reference counts and unique identifiers, clients exist for many languages including Python and C, and you can wrap sequences of commands in a transaction. It also supports lists, sets, sorted sets and several other features you might find useful.

If you can partition your work you can load / process the data from multiple hosts in parallel. Given the speed of redis you might not need to batch things but this is possible (MSET command).

Another nice aspect is you can interact with Redis from the command line using the redis-cli command. This way you can try out / debug sequences of commands before you attempt to write any code. Assuming redis is running on localhost, with default port, type this:

% redis-cli
redis>

I wrote up a quick set of commands which support your use case.

This snippet creates an integer key named next.words.id and increments it atomically, returning the new value. I started the sequence at 123455 for sake of illustration. The (integer) 123456 is the value which would be returned to your client:

redis> SET next.words.id 123455
OK
redis> INCR next.words.id
(integer) 123456

We then map the word to its id"chaos" -> 123456, then create a reverse mapping from id:123456 -> "chaos", and finally create a reference count key set to 0. The prefixes id: and ref: and next.words.id are just conventions I chose -- you can use any naming you like.

redis> SET chaos 123456
OK
redis> SET id:123456 chaos
OK
redis> SET ref:chaos 0
OK

To increment the reference count for the word "chaos":

redis> INCR ref:chaos
(integer) 1
redis> INCR ref:chaos
(integer) 2

To decrement the reference count, use DECR:

redis> DECR ref:chaos
(integer) 1
redis> DECR ref:chaos
(integer) 0

At this point your code could detect that refcount for "chaos" has gone to 0 and execute the following commands in a single transaction: removing the word, and its id: and ref: keys. I used the WATCH command to avoid a race condition: if any other client changes the ref:chaos key before our transaction commits, it will be aborted.

redis> WATCH ref:chaos
OK
redis> GET chaos
(integer) 123456
redis> MULTI
redis> DEL chaos
QUEUED
redis> DEL id:123456
QUEUED
redis> DEL ref:chaos
QUEUED
redis> EXEC
1) (integer) 1
2) (integer) 1
3) (integer) 1

Hope this helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜