In Java what is the best way to ensure that I'm getting UTF-8 strings?
When collecting query parameters from a beaconing system in a servlet, what is the best method in java to ensure that I'm properly conve开发者_Python百科rting all input coming in from 3rd party sites into valid UTF-8 strings I can store in my logfiles?
Java strings are internally always UTF-16. Where you really need to pay attention to encodings is when you convert bytes to Strings and vice versa, because that's what an encoding is: a set of rules to convert between bytes and characters/Strings. NOT a property of Strings. In your case, conversion should happen exactly twice: when you read from the third party sites, and when you write to your logfile.
When reading from the third party sites, you can not just use UTF-8, since those sites can use all kinds of different encodings. Thus you need to adhere to the encoding they declare in the HTTP header, HTML META tag, or XML header. Any decent HTTP client will do that for you, so you just need to let it do its job and not try to do anything fancy yourself.
When writing to your logfile, on the other hand, you should make sure you are using UTF-8 and not the platform default encoding (even if that is UTF-8, it could change). This should be done in your logging library's configuration, or if you write the files without such a library, when you create an OutputStreamWriter
.
Step 1: make sure that the page containing the form is itself in UTF-8.
Step 2: check the headers of the incoming request to see if they give you a character set.
Step 3: don't depend on String(byte[])
or InputStreamReader(InputStream)
. Always call functions that take an explicit character set specification.
The String(byte[] bytes, Charset charset)
constructor allows you to specify the encoding characterset.
精彩评论