开发者

Converting a txt File from ANSI to UTF-8 programmatically

I need your help here please. I'm working on a java application that convert data from a txt file into the database , The problem is that the file have ANSI encoding which i can't change because it comes from outside my application ,and when i write the data to the database i got some "???" inside. My que开发者_如何转开发stion is , how can i convert the data that i read from the file from ANSI to UTF-8 which can handle those weired symbols. I've tried the Byte[] to String converting but it didn't work.


Use open a decoding Reader like this one:

Reader reader = 
   new InputStreamReader(inputStream, Charset.forName(encodingName));

Exaclty which encoding name you should use depends on which "ANSI" encoding the text file was written in. You can find a list of encoding supported by Java 6 here. If it is an English-language system, it will likely be windows-1252.

Writing data to the database correctly depends on configuring the database correctly and (sometimes) providing the right configuration to the JDBC driver.

You can read more about character encoding handling in here and here.


1. what is ANSI?

https://www.cnblogs.com/malecrab/p/5300486.html

2. need libs

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
</dependency>

3. java sample

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.StandardCharsets;
import java.util.Set;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.tika.Tika;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

import com.google.common.collect.Sets;
import lombok.extern.slf4j.Slf4j;

/**
 *
 * @author wang.qingsong
 * Created on 2021/09/16
 */
@Slf4j
public class FileUtil {

    public static boolean isFileEncodingUtf8(File inputFile) throws IOException {
        return isUtf8(getFileEncoding(inputFile));
    }

    public static String getFileEncoding(File file) throws IOException {
        try (FileInputStream fileInputStream = new FileInputStream(file);) {
            return getInputStreamEncoding(fileInputStream);
        }
    }

    public static String getInputStreamEncoding(InputStream input) throws IOException {
        CharsetDetector charsetDetector = new CharsetDetector();
        BufferedInputStream buffInput = null; // close new BufferedInputStream
        try {
            charsetDetector.setText(
                input instanceof BufferedInputStream ? input : (buffInput = new BufferedInputStream(input)));
            charsetDetector.enableInputFilter(true);
            CharsetMatch cm = charsetDetector.detect();
            return cm.getName();
        } finally {
            IOUtils.closeQuietly(buffInput);
        }
    }

    public static void convertFileToUtf8(File inputFile, File outputFile) throws IOException {
        final String encoding = getFileEncoding(inputFile);
        if (StringUtils.isEmpty(encoding)) {
            throw new RuntimeException("inputFile encoding can not parsed!");
        }
        if (isUtf8(encoding)) {
            throw new RuntimeException("inputFile is already utf8, no need convert.");
        }

        if (!outputFile.exists()) {
            outputFile.createNewFile();
        }

        try (FileInputStream inputStream = new FileInputStream(inputFile);
             InputStreamReader inputReader = new InputStreamReader(inputStream, encoding);
             // output
             FileOutputStream outputStream = new FileOutputStream(outputFile);
             OutputStreamWriter outputWriter = new OutputStreamWriter(outputStream, StandardCharsets.UTF_8)) {
            IOUtils.copy(inputReader, outputWriter);
        }
    }

    private static boolean isUtf8(String encoding) {
        final Set<String> aliases = Sets.newHashSet("utf-8", "utf_8", "utf8");
        for (String utf8 : aliases) {
            if (StringUtils.equalsIgnoreCase(utf8, encoding)) {
                return true;
            }
        }
        return false;
    }
}

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜