Resolving an URL with Java gives me the wrong encoded chars in URL
When I'm doing the following:
try {
URL url = new URL(urlAsString);
//using proxy may increase latency
HttpURLConnection hConn = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
// force no follow
hConn.setInstanceFollowRedirects(false);
// the program doesn't care what the content actually is
hConn.setRequestMethod("HEAD");
// default is 0 => infinity waiting
hConn.setConnectTimeout(timeout);
hConn.setReadTimeout(timeout);
hConn.connect();
int responseCode = hConn.getResponseCode();
hConn.getInputStream().close();
if (responseCode == HttpURLConnection.HTTP_OK)
return urlAsString;
String loc = hConn.getHeaderField("Location");
if (responseCode == HttpURLConnection.HTTP_MOVED_PERM && loc != null)
return loc.replaceAll(" ", "+");
} catch (Excep开发者_开发问答tion ex) {
}
return "";
for that url: http://bit.ly/gek1qK I'm getting
http://blog.tweetsmarter.com/twitter-downtime/twitter-redesignsâthen-everything-breaks/
which is wrong. Firefox resolves to
http://blog.tweetsmarter.com/twitter-downtime/twitter-redesigns%E2%80%94then-everything-breaks/
What is wrong in the code?
As per RFC 2616, section 2.2, HTTP header values should normally be encoded using ISO-8859-1.
Here, bit.ly is sending a bad response - the Location: header is encoded using UTF-8, so the em-dash character is represented by three separate bytes (0xe2, 0x80, 0x94).
HttpURLConnection
decodes the bytes using ISO-8859-1 so they become three characters (â
and two undefined characters), but it looks as if you re-encode them using UTF-8 (producing 2 bytes per character, since all three have values >= 0x80) before applying URL-encoding.
Firefox most likely treats the data as ISO-8859-1 throughout; the problem then cancels itself out when URL-encoding is applied later on.
You could do the same by URL-encoding the value returned by getHeaderField()
; since the Unicode range U+0080 to U+00FF is identical to the ISO-8859-1 byte range 0x80-0xFF, the non-ASCII characters can be encoded by casting them to int
values:
/**
* Takes a URI that was decoded as ISO-8859-1 and applies percent-encoding
* to non-ASCII characters. Workaround for broken origin servers that send
* UTF-8 in the Location: header.
*/
static String encodeUriFromHeader(String uri) {
StringBuilder sb = new StringBuilder();
for(char ch : badLocation.toCharArray()) {
if(ch < (char)128) {
sb.append(ch);
} else {
// this is ONLY valid if the uri was decoded using ISO-8859-1
sb.append(String.format("%%%02X", (int)ch));
}
}
return sb.toString();
}
There is nothing wrong. The difference is the m-Dash denoted differently in different encoding. So, If Firefox uses encoding other than what your program does, you will see different character.
Both are correct, in your case. It's just matter of encoding. In Java, you use UTF-8, which is World Wide Web Consortium Recommendation; while it seems that what you see in FF is ISO-8859.
If you want to generate same result as Firefox in Java, try this:
System.out.print(URLEncoder.encode(loc.replace(" ", "+"), "ISO-8859-1"));
It will print what you see in Firefox. (obviously, it will encode /
and :
as well. But just to demonstrate)
精彩评论