开发者

How I get the encoding of a page before I download it?

I need get the encoding of a web page(UTF-8,ISO-8859-1,etc) before I download it because I will convert it from the InputStream downloaded to String using the encode.

I using HttpUrlConnection and there is a method called getContentEncoding, but it will return only if the server sends it.

In some cases, 开发者_C百科the encoding is in the attribute charset(HTML4?), in others in the attribute encoding(XHTML), and others I dont know, but I presume that there are another forms.

There is some class that do this or what is the way to do?


The HTTP 1.1 specification indicates that Content-Type "should" be used to indicate the content, and that responses that do not include this header should be treated as "application/octet-stream" -- in other words, a sequence of bytes rather than characters. The use of "should" indicates that it's recommended practice, but some servers may not follow it.

So, your first step is to look for this header. If it's not present, don't apply any character-set decoding to the content. In the case of XML, assuming that you pass the stream on to a parser this will just work: either the stream will be UTF-8 encoded, or the prologue will specify the encoding. And you should always pass streams directly to an XML parser, without attempting to convert them to a string first.

If there is a Content-Type header, and it specifies a character set, then you're free to decode according to that character set. The spec also talks about how to deal with a missing character set: for any text content type, you should assume that it is encoded using ISO-8859-1.

So that's the next step: if there's a character set, or if it's a text media type, apply the decoding.

Otherwise, leave the stream alone.


Perhaps you could try issuing a HEAD request to fetch the HTTP headers before attempting to fully process the page? HTTPUrlConnection has setRequestMethod, where you could specify HEAD.

With a HEAD request, the server is supposed to return all headers but without the message body. You can try parsing the Content-Type header value. Example headers returned from server would be:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix)  (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8

The following snippet should give you an idea of how to iterate and read the headers returned in a HEAD request.

int i=1;// this will print all header parameter
String hKey;
while ((hKey=conn.getHeaderFieldKey(i))!=null){
    String hVal = conn.getHeaderField(i);
    System.out.println(hKey+"="+hVal);
    i++;
}


There is no guarantee that you can do this without inspecting the document.

The HTML 4.0.1 spec details how to specify the encoding via the Content-Type HTTP header and/or the meta elements within the document.

In the case of XHTML served with Content-Type: application/xhtml+xml the encoding must be discovered from the document.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜