Error resolving wikipedia url with unicode character with Java URL

2023-03-10 21:30 问答作者：

I'm having trouble getting wikipedia urls including unicode!

Given a page title like: 1992\u201393_UE_Lleida_seasonnow

Just plain url ... http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow

Using URLEncoder (set to UTF-8) .... http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow

When I try to resolve either url, I get nothing. If I copy the urls into 开发者_如何学运维my browser, I get nothing too- it's only if I actually copy the unicode character in that I get the page.

Does wikipedia have some strange way to encode unicode in urls? Or am I just doing something dumb?

Here's the code I'm using:

URL url = new URL("http://en.wikipedia.org/wiki/"+x);
System.out.println("trying "+url);  

// Attempt to open the wiki page
InputStream is;
        try{ is = url.openStream();
} catch(Exception e){ return null; }

The correct URI is http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season.

Many browsers display literals instead of percent-encoded escape sequences. This is considered to be more user-friendly. However, correctly encoded URIs must use percent encoding for characters not permitted in the path part:

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters
   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>
   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   pct-encoded   = "%" HEXDIG HEXDIG
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

The URI class can help you with such sequences:

Characters in the other category are permitted wherever RFC 2396 permits escaped octets, that is, in the user-information, path, query, and fragment components, as well as in the authority component if the authority is registry-based. This allows URIs to contain Unicode characters beyond those in the US-ASCII character set.

String literal = "http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow";
URI uri = new URI(literal);
System.out.println(uri.toASCIIString());

You can read more about URI encoding here.

Does wikipedia have some strange way to encode unicode in urls?

It's not really strange, it's standard use of IRIs. The IRI:

http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season

which includes a Unicode en-dash, is equivalent to the URI:

http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

You can include the IRI form in links and it will work in modern browsers. But many network libraries—including Java's, along with older browsers—require ASCII-only URIs. (Modern browsers will still show the pretty IRI version in the address bar, even if you linked to it with the encoded URI version.)

To convert an IRI to a URI in general, you use the IDN algorithm on the hostname, and URL-encode any other non-ASCII characters as UTF-8 bytes. In your case, it should be:

String urlencoded= URLEncoder.encode(x, "utf-8").replace("+", "%20");
URL url= new URL("http://en.wikipedia.org/wiki/"+urlencoded);

Note: replacing + with %20 is necessary to make values of x with spaces in work. URLEncoder does application/x-www-form-urlencoded-encoding as using in query strings. But in a path-URL-segment like this, the +-means-space rule does not apply. Spaces in paths must be encoded with normal-URL-encoding, to %20.

Then again... in the specific case of Wikipedia, for readability, they replace spaces with underlines instead, so you'd probably be better off replacing "+" with "_" directly. The %20 version will still work because they redirect from there to the underline version.

Here's a simple algorithm for encoding URLs that use Unicode so that you can use HttpURLConnection to retrieve them:

import static org.junit.Assert.*;

import java.net.URLEncoder;

import org.apache.commons.lang.CharUtils;
import org.junit.Test;

public class InternationalURLEncoderTest {

    static String encodeUrl(String urlToEncode) {
        String[] pathSegments = urlToEncode.split("((?<=/)|(?=/))");
        StringBuilder encodedUrlBuilder = new StringBuilder();
        for (String pathSegment : pathSegments) {
            boolean needsEncoding = false;
            for (char ch : pathSegment.toCharArray()) {
                if (!CharUtils.isAscii(ch)) {
                    needsEncoding = true;
                    break;
                }
            }
            String encodedSegment = needsEncoding ? URLEncoder
                    .encode(pathSegment) : pathSegment;
            encodedUrlBuilder.append(encodedSegment);
        }
        return encodedUrlBuilder.toString();
    }

    @Test
    public void test() {
        assertEquals(
                "http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401",
                encodeUrl("http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401"));
        assertEquals("http://www.ttv.com.tw/",
                encodeUrl("http://www.ttv.com.tw/"));
        assertEquals("http://www.ttv.com.tw",
                encodeUrl("http://www.ttv.com.tw"));
        assertEquals("http://www.rt-drive.com.tw/shopping/?st=16",
                encodeUrl("http://www.rt-drive.com.tw/shopping/?st=16"));
    }

}

The algorithm was written using these answers on string splitting and detecting Unicode characters

Here's a simpler way of encoding the URL in Chi's answer:

static String encodeUrl(String urlToEncode) throws URISyntaxException {
    return new URI(urlToEncode).toASCIIString();
}

See this answer for clarification.

继续阅读：unicode utf-8 wikipedia

Error resolving wikipedia url with unicode character with Java URL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？