How do web servers know the charset using in forms posted to them?

2023-03-13 23:20 问答作者：

When a web server gets a POST of a form, parsing it into param-value(s) pairs is quite straightforward. However, if the values contain non-English chars that have been encoded by the browser, it must know the charset used in order to decode them.

I've examined the requests sent by two posts. One was done from a page using UTF-8, and one from a page using Windows-1255. The same text was encoded differently. AFAIK, the Content-type header could contain a charset after the application/x-www-form-urlencoded, but it wasn't (using Firefox).

In a servlet, 开发者_StackOverflow中文版when you use request.getParameter(), you're supposed to get the decoded value. How does the servlet container do that? Does it always bet on UTF-8, use some heuristics, or is there some deterministic way I'm missing?

From the Serlvet 3.0 Spec, section 3.10 Request Data Encoding (emphasis mine)

Currently, many browsers do not send a char encoding qualifier with the ContentType header, leaving open the determination of the character encoding for reading HTTP requests. The default encoding of a request the container uses to create the request reader and parse POST data must be “ISO-8859-1” if none has been specified by the client request. However, in order to indicate to the developer, in this case, the failure of the client to send a character encoding, the container returns null from the getCharacterEncoding method.

If the client hasn’t set character encoding and the request data is encoded with a different encoding than the default as described above, breakage can occur. To remedy this situation, a new method setCharacterEncoding(String enc) has been added to the ServletRequest interface. Developers can override the character encoding supplied by the container by calling this method. It must be called prior to parsing any post data or reading any input from the request. Calling this method once data has been read will not affect the encoding.

In practice, I find that setting the charset in a response influences the charset used in the subsequent POST. To be extra sure, you can write a Servlet Filter that calls the setCharacterEncoding on every request object before it is used.

You may also find this thread useful - Detecting the character encoding of an HTTP POST request

The apropriate header for specifying charsets is Accept-Charset.

Latest Chrome for linux, e.g., spits: Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3

on each request.

Section 14.2 from http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html states:

The Accept-Charset request-header field can be used to indicate what character sets are acceptable for the response. This field allows clients capable of understanding more comprehensive or special- purpose character sets to signal that capability to a server which is capable of representing documents in those character sets.

(...)

If no Accept-Charset header is present, the default is that any character set is acceptable. If an Accept-Charset header is present, and if the server cannot send a response which is acceptable according to the Accept-Charset header, then the server SHOULD send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed.

So if you receive such a header from a client, the value with highest q can be the encoding you're receiving from it.

继续阅读：encoding forms http webserver

How do web servers know the charset using in forms posted to them?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？