开发者

Reading lines of text in unknown encoding

I need to read a text file line by line, and apply to each of them several CharsetDecoders, in order. Actually, I first try to decode line as if it's an UTF8-encoded one, and fallback to one-byte charset if UTF8 CharsetDecoder raises MalformedInputException.

However, if I use InputStreamReader with default or specified charset, readLine function silently replaces with '?' all the bytes it thinks are invalid for the specified charset.

I finally ended up writing my own function for reading lines, that reads from a stream byte by byte, seeks for line terminators and constructs lines. But this way it appears terribly slow.

Is there any way to make Java to read lines without touching bytes?

UPDATE: I've found ou开发者_C百科t that there are charsets in which all 256 bytes are valid, two of them line terminators. So it is possible to read raw byte stream line by line. Examples of such charsets are:

IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-4 ISO-8859-5 ISO-8859-9 KOI8-R KOI8-U windows-1256

The question is now closed.


You can't use a reader class and not expecting it to decode the underlying byte stream. If you have a file where each line is encoded in a different charset (?), then you'd better of devise a method of detecting the underlying character encoding. Perhaps you can use an encoding detector such as juniversalchardet.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜