开发者

How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

These faulty strings have been saved into multiple table fields all over a MySQL database. All fields being affected use collation utf8_general_ci.

Usually I'd setup a little PHP patch script, looping thru the affected tables, SELECTing the records, correct the faulty records by using utf8_decode() on the double-encoded fields and UPDATE them.

As I got many and huge tables this time, and the error only affects german umlauts (äöüßÄÖÜ), I'm wondering if there's a solution smarter/faster than that.

Are pure MySQL solutions like the following safe and recommendable?

 UP开发者_运维百科DATE `table` SET `col` = REPLACE(`col`, 'ä', 'ä');

Any other solutions/best practices?


Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(is the correct syntax iirc; put the appropriate column type in where ... is)


I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)


I found the following approach simpler:

mysqldump -h DB_HOST -u DB_USER -p --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

Then drop all tables and re-import with following command:

mysql -h DB_HOST -u DB_USER -p --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

Tip was found at this URL: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/


MySql is charset aware, so you can convert in SQL. But for this case, I would probably prefer to just script it in PHP, as it's a one-off task anyway.

Keep in mind that columns in MySql have a charset property. The collation is (in theory) orthogonal to the charset. While a utf8_general_ci collation would imply that the charset is utf8, it's not a given. You could in theory mix a utf8 collation with a latin1 encoding (And get garbage as a result).

If you decide to do this in SQL, look here:

http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html


MySQL provides a regexp match but no regexp replace, so you're usually better off iterating through each row in php, converting as needed, and updating the row if it has been changed.


Generate a dump using mysqldump, change the encoding declaration (it's in the first commands), and reload in another database.

You can also use iconv on your dump to transcode it.

You can SELECT INTO OUTFILE, massage the file using php or iconv, then LOAD DATA INFILE.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜