Corrupt UTF-8 Characters with PHP 5.2.10 and MySQL 5.0.81
We have an application hosted on both a local development server and a live site. We are experiencing UTF-8 corruption issues and are looking to figure out how to resolve them.
The system is run using symfony 1.0 with Propel.
On our development server, we are running PHP 5.2.0 and MySQL 5.0.32. We do not experience corrupted UTF-8 characters there.
On our live site, PHP 5.2.10 and MySQL 5.0.81 is running. On that server, certain characters such as ố and Σ are corrupted once they are stored in the database. The corrupted characters are showing up as either question marks or approximations of the original character with adjacent question marks.
Examples of corruption:
Uncorrupted: ố Corrupted: ô?
Uncorrupted: Σ Corrupted: ?
We are currently using the following techniques on both development and live servers:
Executing the following queries prior to execution of any other queries:
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci' SET CHARSET 'utf8'
Setting the
<meta>
Content-Type value to:<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Adding the following to our .htaccess file:
AddDefaultCharset utf-8
Using
mb_*
(multibyte) PHP functions where necessary.Being sure to set database columns to use utf8_unicode_ci collation.
These tech开发者_如何学编程niques are sufficient for our development site, but do not work on the live site.
On the live site I've also tried adding mysql_set_encoding('ut8', $mysql_connection)
but this does not help either. I have found some evidence that newer versions of PHP and MySQL are mishandling UTF-8 character encodings.
Take a minimal example - submit a form with just say a Σ character, and isolate where in the path from browser to database it's being corrupted. Print the value of bin2hex($str) to a log in your PHP app, as soon as possible after the request is received from the browser, the last thing before you hand it off to the database, and anywhere else you suspect might be a problem area - for Σ it should print out cea3. In the database, run charset(col), hex(col) on the saved data - if it's all working it should print out ["utf8", "cea3"] (assuming that the tables are utf8).
The clearer you can be about how the characters are corrupted - do they show up as blank? as mojibake? as empty "tofu" boxes? is the string truncated at the problem character? what are the expected and observed byte representations? - the more clues you'll have about what might have caused it.
Note that if you are using Doctrine or Propel mysql_set_encoding()
is going to have no effect as both ORMs are PDO based (Propel < 1.3 is Creole/Mysqli based).
Our experienced system administrator discovered a fix:
alter database DB_NAME character set utf8;
This solved our problems completely.
Are the chars corrupted only once they are stored in the DB? What about PHP?
Try adding the accept-charset="UTF-8"
attribute to your form elements.
Also, make sure your .php
files are UTF-8 encoded without BOM.
I had the same problem, but it wasn't MySQL since Doctrine is taken care of everything, from DB collation to tables and connection in UTF-8.
I found substr()
to be causing troubles with umlauts(ü) so I switched to mb_substr()
, mb_strlen()
etc. for user input data. Do not forget about the ini setting or it won't work:
ini_set('mbstring.internal_encoding','UTF-8');
OR
mbstring.internal_encoding = UTF-8; //in php.ini
精彩评论