'a' char appended when reading unicode text from txt file in android

2023-03-05 18:07 问答作者：

Hello I am trying to read a UTF-8 encoded txt files with Hebrew chars on my android application, and now after managing doing for some reason the 'a' char is always appended at the beginning of the String i read.. and I wonder why

Here is my code:

        void Read(){
        try {
            File fileDir = new File("/sdcard/test.txt");

            BufferedReader in = new Buffe开发者_StackOverflow中文版redReader( new InputStreamReader(
                          new FileInputStream(fileDir), "UTF8"));

            String str;

            while ((str = in.readLine()) != null) {
                    Log.i("TEST",str);
            }

                    in.close();
            } 
            catch (UnsupportedEncodingException e) 
            {
                System.out.println(e.getMessage());
            } 
            catch (IOException e) 
            {
                System.out.println(e.getMessage());
            }
            catch (Exception e)
            {
                System.out.println(e.getMessage());
            }
        }

this is the result i get

05-15 01:53:25.269: INFO/TEST(16236): אבגדהוזחטיכלמנסעפצקרשתa

In order to get a better answer, I need two questions answered:

What is the exact code point of the character in question (your "a")?
What is the exact byte sequence in your file, around the questionable area?

I'm going to take a guess here: You say the character is the first thing in the file ("appended at the beginning of the String") and that you got back it's in the Arabic Presentation Forms B block. The last character of Arabic Presentation Forms B, which oddly has nothing to do with Arabic, is U+FFEF, or the byte order mark (BOM). It usually appears at the beginning of UTF-16 or UTF-32 encoded files, and identifies the "endianess" of the encoding (whether the file is UTF-16LE or UTF-16BE encoded, likewise for UTF-32). It typically does not appear, however, in UTF-8 data, as UTF-8 has no notion of "byte order". That said, some brain-dead Windows programs will stick it there, and then have an additional option of "UTF-8 without BOM". (The BOM is used then to identify a file as likely being encoded in UTF-8.) My guess is you have a BOM in your data, and your program is reading it and passing it on to you.

IF this is your problem, and your file is genuinely encoded in UTF-8, you should be able to find the following byte sequence near the beginning of the file: EF BB BF — this is the UTF-8 representation of U+FFEF.

继续阅读：android file-read hebrew unicode

'a' char appended when reading unicode text from txt file in android

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？