key-value store suggestion
I need a very 开发者_运维问答basic key-value store for java. I started with a HashMap but it seems that HashMap is somewhat space inefficient (I'm storing ~20 million records, and seems to require ~6GB RAM).
The map is Map<Integer,String>
, and so I'm considering using GNU Trove TIntObjectHashMap<byte[]>
, and storing the map value as an ascii byte array rather than String.
As an alternative to that, is there a key-value store that only requires adding jar files, does not hold the entire map in RAM at once, and is still reasonably fast?
BabuDB
BabuDB is an embedded non-relational database system. Its lean and simple design allows it to persistently store large amounts of key-value pairs without the overhead and complexity of similar approaches such as BerkeleyDB.
License: New BSD license, Language: Java
JDBM2
JDBM2 provides HashMap and TreeMap which are backed by disk storage.
License: Apache License 2.0, Language: Java
Banana DB
Banana DB is a self-contained key/value pair database implemented in Java.
License: Apache License 2.0, Language: Java
I've tried BabuDB and JDBM2 and they work fine. BabuDB is a little bit more difficult to set up, but potentially delivers higher performance than JDBM2.
These all all databases, which allow to persist data on disk. There are also solutions to hold a large map in memory (ehcache, hazelcast, ...).
Use Berkeley DB.
Berkeley DB stores object graphs, objects in collections, or simple binary key/value data directly in an a btree on disk. This simple, highly efficient approach removes all the unnecessary overhead in ORM solutions. Using the Direct Persistence Layer (DPL) Java developers annotate classes with storage information, much like JPA. This approach is familiar, efficient, and fast. The DPL reduces the complexity of data storage while not sacrificing speed.
This should definitely give you huge gains in memory and speed, while not increasing the complexity of your application. Enjoy!
http://www.mapdb.org/ is what you are looking for. It's a rocking fast persistent implementation of java.util.Map.
Features
Concurrent
MapDB has record level locking and state-of-art concurrent engine. Its performance scales nearly linearly with number of cores. Data can be written by multiple parallel threads.
Fast
MapDB has outstanding performance rivaled only by native DBs. It is result of more than a decade of optimizations and rewrites.
ACID
MapDB optionally supports ACID transactions with full MVCC isolation. MapDB uses write-ahead-log or append-only store for great write durability.
Flexible
MapDB can be used everywhere from in-memory cache to multi-terabyte database. It also has number of options to trade durability for write performance. This makes it very easy to configure MapDB to exactly fit your needs.
Hackable
MapDB is component based, most features (instance cache, async writes, compression) are just class wrappers. It is very easy to introduce new functionality or component into MapDB.
SQL Like
MapDB was developed as faster alternative to SQL engine. It has number of features which makes transition from relational database easier: secondary indexes/collections, autoincremental sequential ID, joins, triggers, composite keys…
Low disk-space usage
MapDB has number of features (serialization, delta key packing…) to minimize disk used by its store. It also has very fast compression and custom serializers. We take disk-usage seriously and do not waste single byte.
Consider Koloboke Collections, which is up to 2 times faster than Trove according to various tests:
- Time - memory tradeoff with the example of Java Maps
- Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove
if configured to consume the same memory as Trove. Or alternatively, you can think it consumes considerably lesser memory if configured to be equally fast to Trove.
If you want to persist the map between JVM runs with very quick bootstrap, you might also be interested in Chronicle-Map which stores String
s in UTF-8 by default (so you shouldn't bother with conversions String
<-> byte[]
as with Koloboke/Trove). Chronicle-Map is ultra fast for persisted key-value store, but a bit slower that Koloboke and even Trove.
Just wanted to reference some more open source options that became available over time since this question was first asked.
Apache 2, BTree, Apache Directory Project JDBM replacement effort:
http://directory.apache.org/mavibot/
MPL2/EPL1, RTree, MVStore, H2 Storage Engine:
http://www.h2database.com/html/mvstore.html
Apache 2, Xodus Environments, JetBrains YouTrack and Hub storage engine:
https://github.com/JetBrains/xodus
The map is Map, and so I'm considering using GNU Trove TIntObjectHashMap, and storing the map value as an ascii byte array rather than String.
This doesn't entirely make sense because a TIntObjectHashMap
is not a Map
. However, the approach is sound.
Do you know what kind of space savings I can expect over HashMap for Trove?
The best answer is to try it out.
But here are some rough estimates (assuming a 32bit JVM):
HashMap keys would need to be Integer instances. They will occupy ~18bytes per instance + 4 bytes per reference. Total 24 bytes.
Trove keys would be 4 byte
int
values.String values would be 20 bytes + 12 bytes + 2 * number of "characters".
Byte array values would be 12 bytes + 1 * number of "characters".
I haven't examined the details of the respective hash table internal data structures.
That probably amounts to around 50% memory saving, though it depends critically on the average length of the value "strings". (The longer they are, the more they will dominate the space usage.)
FWIW, Trove publish their own benchmarks here. They don't look very convincing, but you should be able to dig out their benchmark code and modify it to better match your use-case.
You can use Xodus KV. That is a key-value store used in production by JetBrains in the YouTrack product. It provides snapshot isolation with readers not competing with writers. JetBrains actively supports it.
Xodus also has an entity store solution which, along with ORM implemented in Kotlin, can be used as a primary database in your project.
There are plans to implement SQL language, which will allow using Xodus as the primary database for projects written in other languages.
精彩评论