Creating a Key-Value Store on Disc with Concurrency in Java

2023-03-25 20:00 问答作者：

I need to read a set of files and break it in to key-value pairs, and save these as a (key,list of values) for that key on disc, much like the map-reduce paradigm. Everything is on one computer though. I could for example write the different lists on different files and name the files with the key. That seems like a very poor way of doing things. To begin with if you have a billion keys, you will end up with a billion files. So obviously that is not going to work, and I will need some sort of memory mapping. I will also have to have different threads doing the map job, so if they were to write to this same buffer there is going to have to be some sort of synchronization between them. If I have a key-value buffer mapping, and synch over the buffers, then the threads shouldn't be stepping on each others toes, so I think that part should work. The question is how do I do the mapping of the values to disc. How do I write buffers that correspond to different keys in the same file? If someo开发者_开发知识库ne could point me in the right direction, it would be much appreciated. My knowledge of this area is quite pathetic. Thanks again.

From a practical standpoint, it would be easy to do this with BerkeleyDB, as Lirik suggested.

If you are more interested in theory than practice, I'd suggest that you approach this as an "external sort" operation. That is, read as much input as you can into memory, then sort by key. Write the sorted chunk out as a single file. The sorted files can then be easily merged into a single file.

Among other applications, this is the approach used by Lucene to build "inverted indexes" for searching text. The "keys" are words in documents, and the "values" are a list of documents in which the word appears. Lucene reads documents, and for each word, creates a term-to-document entry in memory. When memory is full, it writes the index segment to disk. When there are a lot of index segments on disk, they are merged into a single segment. In fact, you could also adapt Lucene's index writer to your task.

The work can be partitioned into multiple threads. However, you have to be sensitive to disk contention. Skipping around to read and write many files concurrently will slow a conventional drive down a lot. There may be opportunities to schedule some activities concurrently. You could probably read in new data from one file while you are writing the previous sorted chunk to disk, especially if the machine has two disk drives. Of course, using an SSD for temporary storage of some of the sorted segments would help immensely.

I think Oracle's Berkeley DB might be just the thing for you:

Creating a Key-Value Store on Disc with Concurrency in Java

Berkeley DB is designed to store data as opaque byte arrays of data in key/value pairs indexed in one of the available access methods, as seen above.

Berkeley is very robust, mature and fast, but if you want to go with a more lightweight approach then use SQLite.

Another option is to use Google's LevelDB; it's written in C++ but there are Java wrappers around it. LevelDB is mind-numbingly fast and very lightweight!

Without having any more details on your project, I can only say:

With all of these solutions the key/value pairs will be stored in the same file (multiple instances can store to separate files if necessary, but I don't see why it would be).
BerkeleyDB and LevelDB have really good caching and mapping capabilities.
BDB and LDB also allow for compression (not sure if SQLite does too).
Depending on your key distribution (i.e. perhaps if you use a good hashing function like Google's CityHash), you may achieve really good data locality so you reduce table scans.
You should probably write your own thread safe buffer(s) and you should avoid having multiple threads write to BDB/LDB since these solutions are disk-based and you generally don't want multi-threaded disk I/O operations.

Critique: - I'm not sure what you mean by "key-value buffer mapping"... are you mapping a buffer to each key? Why do you need that?

Chronicle Map should be a good solution for this problem.

Generally it is very efficient both in terms of operations speed and consumed memory, i. e. it's much faster than BerkeleyDB, suggested before.

Chronicle Map is a segmented storage, and allows parallel processing of segments, e. g:

for (int i = 0; i < chronicleMap.segments(); i++) {
  int segmentIndex = i;
  executor.submit(() -> {
    chronicleMap.segmentContext(segmentIndex).forEachSegmentEntry(entry -> {
      // do processing with entry.key() and entry.value(),
      // value() could be a List or some Iterator-like abstraction
    });
  });
}

See MapSegmentContext Javadocs.

However, having (logically) multiple values per key could not always be handled efficiently with Chronicle Map. But in your case, if you need just processing of the static set of values per each key, not adding/removing values, it could work well.

Have you looked at using Hadoop?

继续阅读：concurrency file file-io file-mapping

Creating a Key-Value Store on Disc with Concurrency in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？