Error resolving wikipedia url with unicode character with Java URL
I'm having trouble getting wikipedia urls including unicode!
Given a page title like: 1992\u201393_UE_Lleida_seasonnow
Just plain url ... http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow
Using URLEncoder (set to UTF-8) .... http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow
When I try to resolve either url, I get nothing. If I copy the urls into 开发者_如何学运维my browser, I get nothing too- it's only if I actually copy the unicode character in that I get the page.
Does wikipedia have some strange way to encode unicode in urls? Or am I just doing something dumb?
Here's the code I'm using:
URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);
// Attempt to open the wiki page
InputStream is;
try{ is = url.openStream();
} catch(Exception e){ return null; }
The correct URI is http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season
.
Many browsers display literals instead of percent-encoded escape sequences. This is considered to be more user-friendly. However, correctly encoded URIs must use percent encoding for characters not permitted in the path part:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The URI class can help you with such sequences:
- Characters in the other category are permitted wherever RFC 2396 permits escaped octets, that is, in the user-information, path, query, and fragment components, as well as in the authority component if the authority is registry-based. This allows URIs to contain Unicode characters beyond those in the US-ASCII character set.
String literal = "http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow";
URI uri = new URI(literal);
System.out.println(uri.toASCIIString());
You can read more about URI encoding here.
Does wikipedia have some strange way to encode unicode in urls?
It's not really strange, it's standard use of IRIs. The IRI:
http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season
which includes a Unicode en-dash, is equivalent to the URI:
http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season
You can include the IRI form in links and it will work in modern browsers. But many network libraries—including Java's, along with older browsers—require ASCII-only URIs. (Modern browsers will still show the pretty IRI version in the address bar, even if you linked to it with the encoded URI version.)
To convert an IRI to a URI in general, you use the IDN algorithm on the hostname, and URL-encode any other non-ASCII characters as UTF-8 bytes. In your case, it should be:
String urlencoded= URLEncoder.encode(x, "utf-8").replace("+", "%20");
URL url= new URL("http://en.wikipedia.org/wiki/"+urlencoded);
Note: replacing +
with %20
is necessary to make values of x
with spaces in work. URLEncoder
does application/x-www-form-urlencoded
-encoding as using in query strings. But in a path-URL-segment like this, the +
-means-space rule does not apply. Spaces in paths must be encoded with normal-URL-encoding, to %20
.
Then again... in the specific case of Wikipedia, for readability, they replace spaces with underlines instead, so you'd probably be better off replacing "+"
with "_"
directly. The %20
version will still work because they redirect from there to the underline version.
Here's a simple algorithm for encoding URLs that use Unicode so that you can use HttpURLConnection to retrieve them:
import static org.junit.Assert.*;
import java.net.URLEncoder;
import org.apache.commons.lang.CharUtils;
import org.junit.Test;
public class InternationalURLEncoderTest {
static String encodeUrl(String urlToEncode) {
String[] pathSegments = urlToEncode.split("((?<=/)|(?=/))");
StringBuilder encodedUrlBuilder = new StringBuilder();
for (String pathSegment : pathSegments) {
boolean needsEncoding = false;
for (char ch : pathSegment.toCharArray()) {
if (!CharUtils.isAscii(ch)) {
needsEncoding = true;
break;
}
}
String encodedSegment = needsEncoding ? URLEncoder
.encode(pathSegment) : pathSegment;
encodedUrlBuilder.append(encodedSegment);
}
return encodedUrlBuilder.toString();
}
@Test
public void test() {
assertEquals(
"http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401",
encodeUrl("http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401"));
assertEquals("http://www.ttv.com.tw/",
encodeUrl("http://www.ttv.com.tw/"));
assertEquals("http://www.ttv.com.tw",
encodeUrl("http://www.ttv.com.tw"));
assertEquals("http://www.rt-drive.com.tw/shopping/?st=16",
encodeUrl("http://www.rt-drive.com.tw/shopping/?st=16"));
}
}
The algorithm was written using these answers on string splitting and detecting Unicode characters
Here's a simpler way of encoding the URL in Chi's answer:
static String encodeUrl(String urlToEncode) throws URISyntaxException {
return new URI(urlToEncode).toASCIIString();
}
See this answer for clarification.
精彩评论