"Smart" / Economical Data Storage Techniques?

2023-02-18 07:22 问答作者：

I would like to store millions of data lines that looks like this:

key, value

key is an integer in the range of (0 to 5,000,000); all values are unique;

value is an unsigned int16 value (0 to 65535)

the key is to store the data whil开发者_如何学编程e taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?

just in case it matters, I use Linux.

One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.

Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.

That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).

how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.

can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.

[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project

This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)

Mimisbrunnr also has a good answer if all the keys are used.

If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.

(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)

Happy coding.

Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.

Another example of a compact database engine is KeyDB for linux:

http://3d2f.com/programs/11-989-keydb-download.shtml

继续阅读：storage

"Smart" / Economical Data Storage Techniques?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？