php utf-8 encoding problems
Hi All: I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.
Now, I think the best library php function for my purpose is probably:
$decode_str = mb_convert_encoding($str, 'UTF-8', 'auto')开发者_JS百科;
I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.
In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.
Any suggestions?
I spend once a lot of time to create a safe UTF8 encoding function:
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not be converted to UTF-8');
}
}
return $content;
}
The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!
I ran into this problem while using json_encode. I use this to get everything into utf8. Source: http://us2.php.net/manual/en/function.json-encode.php
function ascii_to_entities($str)
{
$count = 1;
$out = '';
$temp = array();
for ($i = 0, $s = strlen($str); $i < $s; $i++)
{
$ordinal = ord($str[$i]);
if ($ordinal < 128)
{
if (count($temp) == 1)
{
$out .= '&#'.array_shift($temp).';';
$count = 1;
}
$out .= $str[$i];
}
else
{
if (count($temp) == 0)
{
$count = ($ordinal < 224) ? 2 : 3;
}
$temp[] = $ordinal;
if (count($temp) == $count)
{
$number = ($count == 3) ? (($temp['0'] % 16) * 4096) +
(($temp['1'] % 64) * 64) +
($temp['2'] % 64) : (($temp['0'] % 32) * 64) +
($temp['1'] % 64);
$out .= '&#'.$number.';';
$count = 1;
$temp = array();
}
}
}
return $out;
}
精彩评论