开发者

"Smart" / Economical Data Storage Techniques?

I would like to store millions of data lines that looks like this:

key, value

key is an integer in the range of (0 to 5,000,000); all values are unique;

value is an unsigned int16 value (0 to 65535)

the key is to store the data whil开发者_如何学编程e taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?

just in case it matters, I use Linux.


One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.

Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.

That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).


how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.

can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.

[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project


This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)

Mimisbrunnr also has a good answer if all the keys are used.

If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.

(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)

Happy coding.


Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.

Another example of a compact database engine is KeyDB for linux:

http://3d2f.com/programs/11-989-keydb-download.shtml

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜