Character corruption going from BufferedReader to BufferedWriter in java

2023-01-13 12:16 问答作者：

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.

I encounter a known problem when text contains a left facing quotation mark. Text such as

mutations to particular “hotspot” regions

becomes

 mutations to particular “hotspot�? regions

I have isolated the problem by writting a simple text copy meathod:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        s开发者_高级运维b.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}

Can anybody offer some advice as how to correct this problem?

★My solution

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();

The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).

Try the following to generate a file with UTF-8 encoding:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));

Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream

In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:

Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

The Javadoc for FileReader says:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:

FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

继续阅读：bufferedreader bufferedwriter html-parsing special-characters

Character corruption going from BufferedReader to BufferedWriter in java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？