What is the most accurate encoding detector? [closed]

2023-01-16 19:39 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 4 years ago.

开发者_StackOverflow Improve this question

After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding in InputStreamReader does not work:

juniversalchardet
jchardet
cpdetector
ICU4J

However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:

UTF-8: Both detected.
Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

Notice that ICU4J is still maintained.

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

I found an answer online:

http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

It says something vealuable here:

The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.

So that's why I am using cpdetector now. I will update the post with the result of it.

I've personally used jchardet in our project (juniversalchardet wasn't available back then) just to check if a stream was UTF-8 or not.

It was easier to integrate with our application than the other and yielded great results.

继续阅读：character-encoding

What is the most accurate encoding detector? [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？