which mysql utf8 collation to use for data coming from latin1
I have a lot of data in mysql in latin1 format, and I'm trying to convert everything to utf8, but so far I didn't find the right collation to convert it correctly. Some of the data are international names, with a lot of accent, and characters from va开发者_运维问答rious language and a primary key on these fields. I have a simple test case that goes like this:
CREATE TABLE utf8_test ( value varchar(30), PRIMARY KEY(value) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');
It's a simple test to demonstrate the problem, but real data is not limited to this.
So far, only utf8_bin can accept everything without duplicate error, but I can't use it, because it's case-sensitive. Am I missing something ?
note: some of the tables have more than a dozen millions rows, so performance is a factor.
As far as I know, MySQL does not yet have case sensitive utf8 collations, so utf8_bin
is about your only choice. utf8_bin
is case sensitive, and treats all the diacritical marks as being separate. As Blair Conrad pointed out, it can make a bit of a mess of sorting. You may be able to solve that by using collate in your SQL statements, although if you have a lot of SQL statements, this approach could be problematic.
CREATE TABLE `utf8_test` (
`value` varchar(30) COLLATE utf8_bin NOT NULL DEFAULT '',
PRIMARY KEY (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');
INSERT INTO utf8_test VALUES ('E');
INSERT INTO utf8_test VALUES ('É');
INSERT INTO utf8_test VALUES ('È');
INSERT INTO utf8_test VALUES ('Ü');
INSERT INTO utf8_test VALUES ('U');
INSERT INTO utf8_test VALUES ('Y');
INSERT INTO utf8_test VALUES ('Ÿ');
SELECT value FROM utf8_test WHERE value = 'E';
+-------+
| value |
+-------+
| E |
+-------+
SELECT value FROM utf8_test WHERE value COLLATE utf8_unicode_ci = 'E';
+-------+
| value |
+-------+
| E |
| e |
| È |
| É |
| è |
| é |
+-------+
SELECT value FROM utf8_test ORDER BY value;
+-------+
| value |
+-------+
| E |
| U |
| Y |
| e |
| u |
| y |
| È |
| É |
| Ü |
| è |
| é |
| ü |
| ÿ |
| Ÿ |
+-------+
SELECT value FROM utf8_test ORDER BY value COLLATE utf8_unicode_ci;
+-------+
| value |
+-------+
| E |
| é |
| è |
| É |
| È |
| e |
| u |
| Ü |
| U |
| ü |
| y |
| Y |
| ÿ |
| Ÿ |
+-------+
The correct collation depends on more than the source character set (Latin-1). It depends on the language and locale that the data came from. If you're just worried about making sure that no characters are folded one into another, utf8_bin is probably the way to go.
You may experience unexpected sorting, though, where the characters are sorted according to the UTF-8 byte values, rather than according to any particular language or regional preferences.
In order to get a case-insensitive collation, you may want to try utf8_general_ci or utf8_unicode_ci, as described in UTF-8: General? Bin? Unicode?
Edit Further research suggests that there's currently no collation that you can use. See http://bugs.mysql.com/bug.php?id=19567.
How's this for an option? Assign utf8_bin to your column(s) but apply a different collation (one that will fold case and accents) on your queries? You won't lose any data, and your queries will be case-insensitive. If having accent-insensitive queries don't hurt you, it could be a compromise...
精彩评论