开发者

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0430\u044f)

How to get encoded version of string (e.g. \u0421\u043b\u0443\u0436\u0435\u0431\u043d\u0开发者_StackOverflow中文版430\u044f) using Java?

EDIT: I guess the question is not very clear... Basically what I want is this:

Given string s="blalbla" I want to get string "\uXXX\uYYYY"


You will need to extract each code point/unit from the String and encode it yourself. The following works for all Strings even if the individual linguistic characters within the String are composed of digraphs or ligatures.

public String getUnicodeEscapes(String aString)
{
    if (aString != null && aString.length() > 0)
    {
        int length = aString.length();
        StringBuilder buffer = new StringBuilder(length);
        for (int ctr = 0; ctr < length; ctr++)
        {
            char codeUnit = aString.charAt(ctr);
            String hexString = Integer.toHexString(codeUnit);
            String padAmount = "0000".substring(hexString.length());
            buffer.append("\\u");
            buffer.append(padAmount);
            buffer.append(hexString);
        }
        return buffer.toString();
    }
    else
    {
        return null;
    }
}

The above produces output as dictated by the Java Language Specification on Unicode escapes, i.e. it produces output of the form \uxxxx for each UTF-16 code unit. It addresses supplementary characters by producing a pair of code units represented as \uxxxx\uyyyy.

The originally posted code has been modified to produce Unicode codepoints in the format U+FFFFF:

public String getUnicodeCodepoints(String aString)
{
    if (aString != null && aString.length() > 0)
    {
        int length = aString.length();
        StringBuilder buffer = new StringBuilder(length);
        for (int ctr = 0; ctr < length; ctr++)
        {
            char ch = aString.charAt(ctr);
            if (Character.isLowSurrogate(ch))
            {
                continue;
            }
            else
            {
                int codePoint = aString.codePointAt(ctr);
                String hexString = Integer.toHexString(codePoint);
                String zeroPad = Character.isHighSurrogate(ch) ? "00000" : "0000";
                String padAmount = zeroPad.substring(hexString.length());
                buffer.append(" U+");
                buffer.append(padAmount);
                buffer.append(hexString);
            }
        }
        return buffer.toString();
    }
    else
    {
        return null;
    }
}

The gruntwork is done by the String.codePointAt() method which returns the Unicode codepoint at a particular index. For a String instance composed of combinational characters, the length of the String instance will not be the length of the number of visible characters, but the number of actual Unicode codepoints. For example, and combine to form क् in Devanagari, and the above function will rightfully return U+0915 U+094d without any fuss as String.length() will return 2 for the combined character. Strings with supplementary characters will be with single codepoints for the individual characters -

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜