"Fix" String encoding in Java
I have a String
created from a byte[]
array, using UTF-8 encoding.
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case i开发者_开发知识库t's too late because it's given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[]
that contains Windows-1252 encoded data. I'll call that byte[]
ib
(for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[]
to a String using the UTF-8 encoding (I'll call that String
is
for "input String"). That String
is the only input that is available to achieve our goal (if ib
were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib
(or the correct decoding of that byte[]
) with only is
available.
Now some people claim that getting the UTF-8 encoded bytes from that is
will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B
and �
and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.
I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: …Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang Kühn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
- The original file was a UTF-8 encoded text file (comma delimited)
- That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
- The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "ü";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Рc3 90 | � c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Þ c3 9e Þ
ß c3 9f | ß c3 9f ß
à c3 a0 | Ã c3 a0 à
á c3 a1 | á c3 a1 á
â c3 a2 | â c3 a2 â
ã c3 a3 | ã c3 a3 ã
ä c3 a4 | ä c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | æ c3 a6 æ
ç c3 a7 | ç c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit: As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).
You can use this tutorial
The charset you need should be defined in rt.jar (according to this)
精彩评论