开发者

Fast alternative to java.nio.charset.Charset.decode(..)/encode(..)

Anybody knows a faster way to do what java.nio.charset.Charset.decode(..)/encode(..) does?

It's currently one of the bottleneck of a technology that I'm using.

[EDIT] Specifically, in my application, I changed one segment from a java-solution to a JNI-solution (because there was a C++ technology that was most suitable for my needs than the Java technology that I was using).

This change brought about some significant decrease in speed (and significant increase in cpu & mem usage).

Looking deeper into the JNI-solution that I used, the java application is communicating with the C++ application via byte[]. These byte[] are produced by Charset.encode(..) from the java side and passed to the C++ side. Then when the C++ response with a byte[], it gets decoded in the java side via Charset.decode(..).

Running this against a profiler, I see that Charset.decode(..) and Charset.encode(..) both took a significantly long time compared to the whole execution time of the JNI-solution (I 开发者_Python百科profiled only the JNI-solution because it's something I could whip up quite quickly. I'll profile the whole application on a latter date once I free up my schedule :-) ).

Upon reading further regarding my problem, it's seems that it's a known problem with Charset.encode(..) and decode(..) and it's being addressed in Java7. However, moving to Java7 is not an option for me (for now) due to some constraints.

Which is why I ask here if somebody knows a Java5 solution / alternative to this (Sorry, should have mentioned that this was for Java5 sooner) ? :-)


The javadoc for encode() and decode() make it clear that these are convenience methods. For example, for encode():

Convenience method that encodes Unicode characters into bytes in this charset.

An invocation of this method upon a charset cs returns the same result as the expression

 cs.newEncoder()
   .onMalformedInput(CodingErrorAction.REPLACE)
   .onUnmappableCharacter(CodingErrorAction.REPLACE)
   .encode(bb); 

except that it is potentially more efficient because it can cache encoders between successive invocations.

The language is a bit vague there, but you might get a performance boost by not using these convenience methods. Create and configure the encoder once, and then re-use it:

 CharsetEncoder encoder = cs.newEncoder()
   .onMalformedInput(CodingErrorAction.REPLACE)
   .onUnmappableCharacter(CodingErrorAction.REPLACE);

 encoder.encode(...);
 encoder.encode(...);
 encoder.encode(...);
 encoder.encode(...);

It always pays to read the javadoc, even if you think you already know the answer.


First part - it is bad idea in general to pass arrays into JNI code. Because of GC, Java has to copy arrays. In the worth case array will be copied two times - on the way to JNI code and on the way back :)

Because of that Buffer class hierarchy was introduced. And of course Java dev team creates a nice way to encode/decode chars:

Charser#newDecoder returns you CharsetDecoder, which could be used to comvert ByteBuffer to CharBuffer according to a Charset. There are two main method versions:

CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
CharBuffer decode(ByteBuffer in)

For the max performance you need the first one. It has no hidden memory allocations inside.

You need to note that Encoder/Decoder could maintance internal state, so be careful (for example if you map from 2byte encoding and input buffer has one half of char...). Also encoder/decoder are not threadsafe


There are very few reasons to "squeeze" a string in a byte array. I would recommend to write the C functions to take utf-16 strings as parameters. This way there is no need for any conversion.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜