'a' char appended when reading unicode text from txt file in android
Hello I am trying to read a UTF-8 encoded txt files with Hebrew chars on my android application, and now after managing doing for some reason the 'a' char is always appended at the beginning of the String i read.. and I wonder why
Here is my code:
void Read(){
try {
File fileDir = new File("/sdcard/test.txt");
BufferedReader in = new Buffe开发者_StackOverflow中文版redReader( new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
Log.i("TEST",str);
}
in.close();
}
catch (UnsupportedEncodingException e)
{
System.out.println(e.getMessage());
}
catch (IOException e)
{
System.out.println(e.getMessage());
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
}
this is the result i get
05-15 01:53:25.269: INFO/TEST(16236): אבגדהוזחטיכלמנסעפצקרשתa
In order to get a better answer, I need two questions answered:
- What is the exact code point of the character in question (your "a")?
- What is the exact byte sequence in your file, around the questionable area?
I'm going to take a guess here: You say the character is the first thing in the file ("appended at the beginning of the String") and that you got back it's in the Arabic Presentation Forms B block. The last character of Arabic Presentation Forms B, which oddly has nothing to do with Arabic, is U+FFEF, or the byte order mark (BOM). It usually appears at the beginning of UTF-16 or UTF-32 encoded files, and identifies the "endianess" of the encoding (whether the file is UTF-16LE or UTF-16BE encoded, likewise for UTF-32). It typically does not appear, however, in UTF-8 data, as UTF-8 has no notion of "byte order". That said, some brain-dead Windows programs will stick it there, and then have an additional option of "UTF-8 without BOM". (The BOM is used then to identify a file as likely being encoded in UTF-8.) My guess is you have a BOM in your data, and your program is reading it and passing it on to you.
IF this is your problem, and your file is genuinely encoded in UTF-8, you should be able to find the following byte sequence near the beginning of the file: EF BB BF
— this is the UTF-8 representation of U+FFEF.
精彩评论