开发者

Java BufferedReader arabic text file problem

Problem: Arabic words in my text fil开发者_如何学JAVAes read by java show as series of question marks : ??????

Here is the code:

        File[] fileList = mainFolder.listFiles();
        BufferedReader bufferReader = null;
        Reader reader = null;


        try{

        for(File f : fileList){           
            reader = new InputStreamReader(new FileInputStream(f.getPath()), "UTF8");
            bufferReader = new BufferedReader(reader);
            String line = null;

            while((line = bufferReader.readLine())!= null){
               System.out.println(new String(line.getBytes(), "UTF-8"));
            }              

        }
        }
        catch(Exception exc){
            exc.printStackTrace();
        }

        finally {
            //Close the BufferedReader
            try {
                if (bufferReader != null)
                    bufferReader.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }

As you can see I have specified the UTF-8 encoding in different places and still I get question marks, do you have any idea how can I fix this??

Thanks


Instead of trying to print out the line directly, print out the Unicode values of each character. For example:

char[] chars = line.toCharArray();
for (int i = 0; i < chars.length; i++)
{
    System.out.println(i + ": " + chars[i] + " - " + (int) chars[i]);
}

Then look up the relevant characters in the Unicode code charts.

If you find it's printing 63, then those really are question marks... which would suggest that your text file isn't truly UTF-8 to start with.

If, on the other hand for some characters it's printing out "?" but then a value other than 63, then that would suggest it's a console display issue and you're reading the data correctly.


Replace

System.out.println(new String(line.getBytes(), "UTF-8"));

by

System.out.println(line);

The String#getBytes() without the charset argument namely uses platform default encoding to get the bytes from the string, which may not be UTF-8 per se. You're already reading the bytes as UTF-8 by InputStreamReader, so you don't need to massage it forth and back afterwards.

Further, ensure that your display console (where you're reading those lines) supports UTF-8. In for example Eclipse, you can do that by Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8.

See also:

  • Unicode - How to get the characters right?
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜