Reading/Writing/Storing extremely large sets of sequential data
I am interacting with large sequential sets of data in Java. Ideally, I'm searching for a library where I 开发者_如何学Pythoncan store streaming data (think sequences of immutable objects) and then jump around through the saved data later. The data should ultimately be stored on the disk and shouldn't be stored in memory in it's entirety. The data would be states of mathematical systems -- so predominately numbers (of doubles, or even BigDecimals) as well as some strings.
At the moment this is for a desktop application, so there would only be one user and maybe a few concurrent connections at a time (several streams of objects/states). Later I may consider a distributed approach and support for multiple clients on the same database backend.
I've been looking at various NoSQL libraries but I am not sure what's right for my needs. Any thoughts?
Take a look to OrientDB: for insertions is very very fast. On my notebook inserts 1,000,000 of entries in 6 seconds. Furthermore is Java and can run as embedded in your process.
If you have any means of calculating the offset for each object you want to access, a simple java.nio.MappedByteBuffer
- the equivalent to mmap - might do the job.
If you have a 64-bit JVM you can memory map the files into memory. This will give you an up to 2 GB window into each file.
When you have multiple clients, you could have a server process which has access to the files or database and caches/distributes data to the clients.
Just use a binary file? Easy if your objects are equal in size; you can use random access to jump around in the file. Your operating system will use its disk cache to provide you caching for free. Sometimes people use a database and SQL interface as a golden hammer.
Have you looked at Berkeley DB Java Edition? It was designed for this type of use case in mind. Large data sets, high write throughput, reliable persistence with a set of very Java developer-friend APIs. You can use the Base API (key/value pairs), the Collections API or the JPA-like DPL (Direct Persistence Layer) API.
There's an excellent Getting Started Guide that has examples and explains the various APIs.
There are lots of similar use cases to yours. In fact, Terracotta and Coherence both use Berkeley DB for persistence. As does Heretix, the Internet Archive project, Tibco and many other companies and projects. The reason is that BDB provides the performance, reliability, scalability, flexibility and simplicity that they need.
Disclaimer: I'm one of the product managers for Berkeley DB, so naturally I'm biased. But your use case sounds exactly on target with what BDB was designed to do.
Good luck with your project. Please let us know if there is anything that we can help with. You can ask questions about Berkeley DB Java Edition on the OTN Forums, where you'll find a large community of active Java application developers.
Regards,
Dave
精彩评论