开发者

How to parse a string that is in a different encoding from java

I have a string that I have read in from a Word document. I think it is in "Cp1252" encoding. Java uses UTF8.

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

specifically, I want to replace the "En Dash" character with a plain "-"

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thin开发者_如何学Gog

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    }
    String projDateString2 = new String(test);
    projDateString2.replaceAll("\0x96", "\u2013");
    System.out.println("projDateString2: " + projDateString)

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

This gives me the following output:

test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present

As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext "-"


Java strings are always in UTF-16, at least as far as the API is concerned... but you can generally just think of them as being "Unicode". The fact that they're UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don't think you need to worry about this in your case. So just think of the values in Strings as "Unicode text" without a specific encoding... in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

You shouldn't be using String.getBytes() or new String(byte[]) without specifying the encoding - that's the problem. Those always use the platform default encoding - which is almost always the wrong choice.

You say you "have a string that I have read in from a Word document" - how did you read it in? How did it start off life?

If you have the bytes and you know the relevant encoding, you should use:

String text = new String(bytes, encoding);

You should never have to deal with a string which has been created using the wrong encoding - if you get to that stage, you're almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won't change the existing string. It will instead return a new string with the replacements made.

So this statement:

projDateString2.replaceAll("\0x96", "\u2013");

will never do what you want. Even if everything else is correct, you should be using:

projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");

(or something similar). I don't think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.


Conversion is generally done by something like this:

String properlyEncoded = 
    new String(original.getBytes(originalEncoding), newEncoding);

Note that it is not unlikely that some information is lost during the conversion.


First you need to make sure that you properly convert from CP1252 bytes to Java's character representation (which is UTF-16). Since you're using a library for parsing .docx files, this has probably happened.

Now all you need to do is call projDateString.replace('\u2013', '-') and do something with the return value. No need for replaceAll(), since you're not working with regular expressions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜