开发者

how to determine text encoding

I know UTF file has BOM for determining encoding but what about other encoding that has no clue how to guess that encoding.

I am new java programmer. I have written code for guessing UTF encoding using UTF BOM. but I have problem with other encoding. How do I gues开发者_运维技巧s them.

Anybody can help me? thanks in Advance.


This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).

  • GuessEncoding
  • jchardet (Java port of the algorithm used by mozilla firefox)

Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.


Short answer is: you cannot.

Even in UTF-8, the BOM is entirely optional and it's often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.

This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.


If you don't know the encoding and don't have any indicators (like a BOM), its not always possible to accurately "guess" the encoding. Some pointers exist that can give you hints.

For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.

The most common solution is to let the user select the encoding if you cannot detect it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜