开发者

How to detect the character encoding of a file?

Our application receives files from our users, and those files must be validated if they are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.

Curren开发者_开发技巧tly, we're using JCharDet (which is a java port of mozilla's character detector), but there are some Shift-JIS characters that it seems to fail to detect as valid Shift-JIS characters.

Any ideas what else we can use?


ICU4J's CharsetDetector will help you.

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
String charsetName = cd.detect().getName();

By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.


Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don't know if its algorithms are as advanced as JCharDet, but it might be worth a try...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜