开发者

Read UTF-16 chars from a file and store them as UTF-8

I have a开发者_JS百科 Person pojo, with a name attribute which I store in my database within the respective persons table. My db server is MySQL with utf-8 set as the default server encoding, the persons table is an InnoDB table which was also created with utf-8 as the default encoding, and my db connection string specifies utf-8 as the connection encoding.

I am required to create and store new Person pojos, by reading their names from a txt file (persons.txt) which contains a name in every line, but the file encoding is UTF-16.

persons.txt

John

Μαρία

Hélène

etc..

Here is a sample code:

PersonDao dao = new PersonDao();
File file = new File("persons.txt");
BufferedReader reader = new BufferedReader(
                        new InputStreamReader(new FileInputStream(file), "UTF-16"));
String line = reader.readLine();
while (line!=null) {
    Person p = new Person();
    p.setName(line.trim());
    dao.save(p);
    line = reader.readLine();
}

To sum up, I am reading string characters as utf-16, store them in local variables and persist them as utf-8.

I would like to ask: Does any character conversion take place during this procedure? If yes, then at what point does this happen? Is it possible that I may end up storing broken characters due to the utf-16 -> utf-8 workflow?


InputStreamReader converts characters from their external representation in the specified encoding (UTF-16 in your case) to the internal representation (i.e. char, String), that is always UTF-16 too, so effectively there is no conversion here in your case.

Internal representation of Strings should be converted to the database encoding by your JDBC driver, so you shouldn't care about it (though in the case of MySQL you should care about specifying the proper database encoding in the connection string).

If input encoding and (in the case of MySQL) database encoding are specified correctly, there are no chances of data loss during conversions, since both UTF-8 and UTF-16 are used to represent the same character set.


UTF-8 and UTF-16 cover the same range of characters (full Unicode), so if the input data is valid, the output data will be valid too (unless there is a bug in dao.save()).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜