开发者

Encoding issues crawling non-english websites

I'm trying to get the contents of a webpage开发者_Go百科 as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

Which outputs:

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

Though it ought to be:

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.


Determining the right charset encoding can be tricky.

You need to use a combination of

a) the HTML META Content-Type tag:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) the HTTP response header:

Content-Type: text/html; charset=utf-8

c) Heuristics to detect charset from bytes (see this question)

The reason for using all three is:

  1. (a) and (b) might be missing
  2. the META Content-Type might be wrong (see this question)

What to do if (a) and (b) are both missing?

In that case you need to use some heuristics to determine the correct encoding - see this question.

I find this sequence to be the most reliable for robustly identifying the charset encoding of an HTML page:

  1. Use HTTP response header Content-Type (if exists)
  2. Use an encoding detector on the response content bytes
  3. use HTML META Content-Type

but you might choose to swap 2 and 3.


The problem you are seeing is that the encoding on your Mac doesn't support Cyrillic script. I'm not sure if it's true on an Oracle JVM, but when Apple was producing their own JVMs, the default character encoding for Java was MacRoman.

When you start your program, specify the file.encoding system property to set the character encoding to UTF-8 (which is what Mac OS X uses by default). Note that you have to set it when you launch: java -Dfile.encoding=UTF-8 ...; if you set it programatically (with a call to System.setProperty()), it's too late, and the setting will be ignored.

Whenever Java needs to encode characters to bytes—for example, when it's converting text to bytes to write to the standard output or error streams—it will use the default unless you explicitly specify a different one. If the default encoding can't encode a particular character, a suitable replacement character is substituted.

If the encoding can handle the Unicode replacement character, U+FFFD, (�) that's used. Otherwise, a question mark (?) is a commonly used replacement character.


Apache Tika contains an implementation of what you want here. Many people use it for this. You could also look into Apache Nutch. On the other hand, then you wouldn't have to implement your own crawler at all.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜