Incorrect string encodings
Note: I have read all of the related PHP, UTF-8, character encoding articles that are usually suggested, but my question relates to data inserted before I applied such techniques. I am wishing to retrospectively fix all character encoding problems.
Now all connections are set as utf8 using PDO.
PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8'
Unfortunately, a large amount of data was inserted that is of questionable encoding before I had implemented correct character encoding practices. As displayed by:
$sql = "SELECT name FROM data LIMIT 3";
foreach ($pdo->query($sql) as $row)
{
$name = $row['name'];
echo $name . "\n";
echo utf8_encode($name) . "\n";
echo utf8_decode($name) . "\n";
echo htmlspecialchars($name, ENT_QUOTES, 'UTF-8') . "\n";
echo htmlspecialchars(utf8_encode($name), ENT_QUOTES, 'UTF-8') . "\n";
echo htmlspecialchars(utf8_decode($name), ENT_QUOTES, 'UTF-8') . "\n";
echo '<hr/>';
}
Which produces:
AntonÃÂn Dvořák
AntonÃÆÃÂn DvoÃâ¦Ãâ¢ÃÆák
Anton�?n Dvo�?�?�?¡k
AntonÃÂn Dvořák
AntonÃÆÃÂn DvoÃâ¦Ãâ¢ÃÆák
----------
Ô±Ö€Õ¡Õ´ Ô½Õ¡Õ¹Õ¡Õ¿Ö€ÕµÕ¡Õ¶
ñÃâ¬Ã¡Ã´ ýáùáÿÃâ¬ÃµÃ¡Ã¶
Ա�?ամ Խաչատ�?յան
Ô±Ö€Õ¡Õ´ Ô½Õ¡Õ¹Õ¡Õ¿Ö€ÕµÕ¡Õ¶
ñÃâ¬Ã¡Ã´ ýáùáÿÃâ¬ÃµÃ¡Ã¶
----------
Tiësto
Tiësto
Tiësto
Tiësto
Tiësto
Tiësto
----------
When removing 'SET NAMES utf8' with PDO it produces the data, which does actually have the correct items, albeit on different lines:
AntonÃn DvoÅák
AntonÃÂn DvoÃÂák
Antonín Dvořák
AntonÃn DvoÅák
AntonÃÂn DvoÃÂák
Antonín Dvořák
----------
Արամ Խաչատրյան
Ô±ÖÕ¡Õ´ Ô½Õ¡Õ¹Õ¡Õ¿ÖÕµÕ¡Õ¶
???? ?????????
Արամ Խաչատրյան
Ô±ÖÕ¡Õ´ Ô½Õ¡Õ¹Õ¡Õ¿ÖÕµÕ¡Õ¶
??开发者_开发知识库?? ?????????
----------
Tiësto
Tiësto
Ti�sto
Tiësto
Tiësto
----------
And here is a dump of the database rows concerned:
DROP TABLE IF EXISTS `data`;
CREATE TABLE IF NOT EXISTS `data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(80) NOT NULL,
PRIMARY KEY (`id`),
KEY `name` (`name`(10)),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
INSERT INTO `data` (`id`, `name`) VALUES (0, 'AntonÃÂn Dvořák'), (1, 'Ô±Ö€Õ¡Õ´ Ô½Õ¡Õ¹Õ¡Õ¿Ö€ÕµÕ¡Õ¶'), (2, 'Tiësto');
The 3rd and 6th lines of the 3rd row "Tiësto" are then correctly echoed. I'm just unsure what is the best way to correct encodings/detect the encodings of bad strings and correct, etc.
One way that should work - I haven't tried this myself - is to dump the database into a file using phpMyAdmin, importing it, and specifying latin1
as the encoding even though it is UTF-8 encoded. (You need the phpMyAdmin version that offers specifying the character set of the dump file in a drop down menu when importing).
This should turn ë
back into ë
. If the data is consistently broken (i.e. it's not a mix of valid UTF-8 characters and broken ones), this may work.
Obviously, make backups before trying this, and look through the data with a fine comb afterwards.
A rather unorthodox solution I've found and after testing, seems to be working is:
Connection A = UTF8 connection Connection B = Old non-UTF8 connection which encoded original data
- With A, I return "name" which displays correctly with B but is corrupted with A due to non-UTF8 DB encoding
- Find the item ID in B by looking up corrupted A value
- Then using A update DB with correctly encoded UTF8 value
Rather convoluted but it seems to be working. Will update if any problems.
精彩评论