开发者

Convert International String to \u Codes in java

How can I convert an international (e.g. Russian) String to \u numbers (unicode num开发者_开发技巧bers)

e.g. \u041e\u041a for OK ?


there is a JDK tools executed via command line as following :

native2ascii -encoding utf8 src.txt output.txt

Example :

src.txt

بسم الله الرحمن الرحيم

output.txt

\u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645

If you want to use it in your Java application, you can wrap this command line by :

String pathSrc = "./tmp/src.txt";
String pathOut = "./tmp/output.txt";
String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
Runtime.getRuntime().exec(cmdLine);
System.out.println("THE END");

Then read content of the new file.


You could use escapeJavaStyleString from org.apache.commons.lang.StringEscapeUtils.


I also had this problem. I had some Portuguese text with some special characters, but these characters where already in unicode format (ex.: \u00e3).

So I want to convert S\u00e3o to São.

I did it using the apache commons StringEscapeUtils. As @sorin-sbarnea said. Can be downloaded here.

Use the method unescapeJava, like this:

String text = "S\u00e3o"
text = StringEscapeUtils.unescapeJava(text);
System.out.println("text " + text);

(There is also the method escapeJava, but this one puts the unicode characters in the string.)

If any one knows a solution on pure Java, please tell us.


Here's an improved version of ArtB's answer:

    StringBuilder b = new StringBuilder();

    for (char c : input.toCharArray()) {
        if (c >= 128)
            b.append("\\u").append(String.format("%04X", (int) c));
        else
            b.append(c);
    }

    return b.toString();

This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä.


There are three parts to the answer

  1. Get the Unicode for each character
  2. Determine if it is in the Cyrillic Page
  3. Convert to Hexadecimal.

To get each character you can iterate through the String using the charAt() or toCharArray() methods.

for( char c : s.toCharArray() )

The value of the char is the Unicode value.

The Cyrillic Unicode characters are any character in the following ranges:

Cyrillic:            U+0400–U+04FF ( 1024 -  1279)
Cyrillic Supplement: U+0500–U+052F ( 1280 -  1327)
Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)

If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString() and prepend the "\\u". Put together it should look something like this:

final int[][] ranges = new int[][]{ 
        {  1024,  1279 }, 
        {  1280,  1327 }, 
        { 11744, 11775 }, 
        { 42560, 42655 },
    };
StringBuilder b = new StringBuilder();

for( char c : s.toCharArray() ){
    int[] insideRange = null;
    for( int[] range : ranges ){
        if( range[0] <= c && c <= range[1] ){
            insideRange = range;
            break;
        }
    }

    if( insideRange != null ){
        b.append( "\\u" ).append( Integer.toHexString(c) );
    }else{
        b.append( c );
    }
}

return b.toString();

Edit: probably should make the check c < 128 and reverse the if and the else bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.


There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.


In case you need this to write a .properties file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.


Apache commons StringEscapeUtils.escapeEcmaScript(String) returns a string with unicode characters escaped using the \u notation.

"Art of Beer 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜