Resolving an URL with Java gives me the wrong encoded chars in URL

2023-02-23 00:15 问答作者：

When I'm doing the following:

try {
    URL url = new URL(urlAsString);
    //using proxy may increase latency
    HttpURLConnection hConn = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
    // force no follow
    hConn.setInstanceFollowRedirects(false);
    // the program doesn't care what the content actually is       
    hConn.setRequestMethod("HEAD");
    // default is 0 => infinity waiting
    hConn.setConnectTimeout(timeout);
    hConn.setReadTimeout(timeout);
    hConn.connect();
    int responseCode = hConn.getResponseCode();
    hConn.getInputStream().close();
    if (responseCode == HttpURLConnection.HTTP_OK)
        return urlAsString;

    String loc = hConn.getHeaderField("Location");
    if (responseCode == HttpURLConnection.HTTP_MOVED_PERM && loc != null)
        return loc.replaceAll(" ", "+");

} catch (Excep开发者_开发问答tion ex) {
}
return "";

for that url: http://bit.ly/gek1qK I'm getting

http://blog.tweetsmarter.com/twitter-downtime/twitter-redesignsâthen-everything-breaks/

which is wrong. Firefox resolves to

http://blog.tweetsmarter.com/twitter-downtime/twitter-redesigns%E2%80%94then-everything-breaks/

What is wrong in the code?

As per RFC 2616, section 2.2, HTTP header values should normally be encoded using ISO-8859-1.

Here, bit.ly is sending a bad response - the Location: header is encoded using UTF-8, so the em-dash character is represented by three separate bytes (0xe2, 0x80, 0x94).

HttpURLConnection decodes the bytes using ISO-8859-1 so they become three characters (â and two undefined characters), ~~but it looks as if you re-encode them using UTF-8 (producing 2 bytes per character, since all three have values >= 0x80) before applying URL-encoding~~.

Firefox most likely treats the data as ISO-8859-1 throughout; the problem then cancels itself out when URL-encoding is applied later on.

You could do the same by URL-encoding the value returned by getHeaderField(); since the Unicode range U+0080 to U+00FF is identical to the ISO-8859-1 byte range 0x80-0xFF, the non-ASCII characters can be encoded by casting them to int values:

/**
 * Takes a URI that was decoded as ISO-8859-1 and applies percent-encoding
 * to non-ASCII characters. Workaround for broken origin servers that send
 * UTF-8 in the Location: header.
 */
static String encodeUriFromHeader(String uri) {
    StringBuilder sb = new StringBuilder();

    for(char ch : badLocation.toCharArray()) {
        if(ch < (char)128) {
            sb.append(ch);
        } else {
            // this is ONLY valid if the uri was decoded using ISO-8859-1
            sb.append(String.format("%%%02X", (int)ch));
        }
    }

    return sb.toString();
}

There is nothing wrong. The difference is the m-Dash denoted differently in different encoding. So, If Firefox uses encoding other than what your program does, you will see different character.

Both are correct, in your case. It's just matter of encoding. In Java, you use UTF-8, which is World Wide Web Consortium Recommendation; while it seems that what you see in FF is ISO-8859.

If you want to generate same result as Firefox in Java, try this:

System.out.print(URLEncoder.encode(loc.replace(" ", "+"), "ISO-8859-1"));

It will print what you see in Firefox. (obviously, it will encode / and : as well. But just to demonstrate)

继续阅读：character-encoding url-encoding

Resolving an URL with Java gives me the wrong encoded chars in URL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？