How to get non-latin characters from website?

2023-02-14 05:21 问答作者：

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnect开发者_如何学Goion urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

It doesn't work. :( Any ideas?

InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.

Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.

Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.

This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.

Here's what you get back:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

The HTML is simply:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?

The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.

The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as

Content-Type: text/html; charset=ISO-8859-2

You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);

The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.

As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.

But you should think about to switch to UTF-8 encoded output anyway.

As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.

The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

继续阅读：character-encoding encoding urlconnection

How to get non-latin characters from website?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？