Java parsing UTF8
I have the following issue with a UTF8 files structured as following:
FIELD1§FIELD2§FIELD3§FIELD4
Looking at hexadecima开发者_StackOverflow社区l values of the file it uses A7
to codify §
. So according to this codify it should be UTF8, but it's strange because A7
> 7F
so 1 byte shouldn't be enough to codify §
.
So I tried using directly a BufferedReader
with a specified charset:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8))
but when I try to tokenize the string with
SmartTokenizer st = new SmartTokenizer(toTokenize, "§")
(the SmartTokenizer
is a modified version of the StringTokenizer
that keeps empty tokens)
no splitting occurs, and if I try to print the string I obtain
FIELD1?FIELD2?FIELD3?...
so §
used in the file is different from the one specified as a the delimiter, and it's not able to print out it too.
So what's the problem here? Maybe the original file should use 2 bytes to store §
?
The UTF-8 encoding of §
is 0xC2 0xA7
.
If the file uses A7
to represent §
, then it's probably writtein in ISO-8859-1 (or another ISO-8859-* or their derivates).
Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8
Uh, why? It's ISO8859-1 (or latin-1 or related one) http://en.wikipedia.org/wiki/ISO/IEC_8859-1
精彩评论