Java: multithreaded character stream decoding

2023-01-10 23:16 问答作者：

I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:

Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
Multiple cores (We own a cheap Nehalem server with 16 hardware threads)

The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...

The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.

But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has开发者_JS百科 to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.

So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?

does anyone know how to parallelize the charset decoding operation in Java?

You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible).

How difficult this would be depends on the encoding you're decoding from. You will need a bespoke solution for the target encoding. If the encoding has a fixed width (e.g. Windows-1252), then one byte == one character and decoding is easy.

Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first).

Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially.

If it is an option, generate your data as UTF-16BE. Then you can cut out decoding and read two bytes straight to a char.

If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details.

It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) )

How do you know that? Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores?

Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed.

If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. But you will certainly need to implement your own decoders to do this.

Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. However, you want to ensure that the chunks overlap (with a parametrized size). If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. The bigger the overlap, the more processing required, and the smaller the probability of error. Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation.

If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned).

If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. In CSV, a sequence for newlines might make sense. Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding.

继续阅读：character-encoding multithreading

Java: multithreaded character stream decoding

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？