Why are the lengths different when converting a byte array to a String and then back to a byte array?

2023-02-11 05:10 问答作者：

I have the following Java code:

byte[] signatureBytes = getSignature();

String signatureString = new String(signatureBytes, "UTF8");
byte[] signatureStringBytes = signatureString.getBytes("UTF8");

System.out.println(signatureBytes.length == signatureStringBytes.length); // prints false

Q: I'm probably misunderstanding this, but I thought that new String(byte[] bytes, String charset) and String.getBytes(charset) are inverse operations?

Q: As a follow开发者_运维技巧 up, what is a safe way to transport a byte[] array as a String?

Not every byte[] is valid UTF-8. By default invalid sequences gets replaced by a fixed character, and I think that's the reason for such a length change.

Try Latin-1, it should not happen, as it's a simple encoding for which each byte[] is meaningful.

Neither for Windows-1252 should it happen. There are undefined sequences there (in fact undefined bytes), but all chars get encoded in a single byte. The new byte[] may differ from the original one, but their lengths must be the same.

I'm probably misunderstanding this, but I thought that new String(byte[] bytes, String charset) and String.getBytes(charset) are inverse operations?

Not necessarily.

If the input byte array contains sequences that are not valid UTF-8, then the initial conversion may turn them into (for example) question marks. The second operation then turns these into UTF-8 encoded '?' characters .... different to the original representation.

It is true that some characters in Unicode have multiple representations; e.g. accented characters can be a single codepoint, or a base character codepoint and a accent codepoint. However, converting back and forth between a byte array (containing valid UTF-8) and String should preserve the codepoint sequences. It doesn't perform any "normalization".

So what would be a safe way to transport a byte[] array as String then?

The safest alternative would be base64 encode the byte array. This has the added advantage that the characters in the String will survive conversion into any character set / encoding that can represent Latin letters and digits.

Another alternative is to use Latin-1 instead of UTF-8. However:

There is a risk of damage if the data gets (for example) mistakenly interpreted as UTF-8.
This approach is not legal if the "string" is then embedded in XML. Many control characters are outside of the XML character set, and cannot be used in an XML document, even encoded as character entities.

Two possibilities come to mind.

First is that your signature isn't entirely valid UTF8. You can't just take any arbitrary binary data and "string" it. Not every clump of bits defines a legal character. The String constructor will insert some default replacement content for binary data that doesn't actually 'mean' anything in UTF8. This is not a reversable process. If you want to "String" some arbitrary binary data, you need to use an established method for doing so, I would suggest org.apache.commons.codec.binary.Base64

There are also some characters that have more than one representation. e.g., things with accents can be encoded as an accented character or as the character plus an accent after that are to be combined. There's no guarantee that this is a reversible process when moving back and forth between encodings.

I wanted to store the data into my JSP page as string, then send the String as parameter to the server side and convert as byte[]. This worked for me:

To convert a byte[] to String

String byteToString = Base64.getEncoder().encodeToString(byteContent);

To Convert from String to byte[]

byte[] stringToByte = Base64.getDecoder().decode(stringContent);

And this returns the exact byte[] with the same length.

继续阅读：decoding encoding utf-8

Why are the lengths different when converting a byte array to a String and then back to a byte array?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？