PHP, MySQL and XML = garbled HTML output
I have a field in MySQL of type text, using the following collation: utf8_general_ci
.
This XML field is populated using a variable built using DOMDocument:
function ed_audit_node($dom, $field, $new, $old){
//create audit_detail node
$ad = $dom->createElement('audit_detail');
$fn = $dom->createElement('fieldname');
$fn->appendChild($dom->createTextNode($field));
$ad-开发者_如何学Python>appendChild($fn);
$ov = $dom->createElement('old_value');
$ov->appendChild($dom->createTextNode($old));
$ad->appendChild($ov);
$nv = $dom->createElement('new_value');
$nv->appendChild($dom->createTextNode($new));
$ad->appendChild($nv);
//append to document
return $ad;
}
Here's how I save to the db ( $xml comes from $dom->saveXML() ):
function ed_audit_insert($ed, $xml){
global $visitor;
$sql = <<<EOF
INSERT INTO ed.audit
(employee_id, audit_date, audit_action, audit_data, user_id)
VALUES (
{$ed[emp][employee_id]},
now(),
'{$ed[audit_action]}',
'{$xml}',
{$visitor[user_id]}
);
EOF;
$req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__));
//snip
}
See an older, parallel, slightly related thread on how I’m creating this XML: Another PHP XML parsing error: "Input is not proper UTF-8, indicate encoding!"
What works: - querying the database, selecting the field and outputting it using jQuery (.ajax()) and populating a textarea. Firebug and the textarea match what's in the database (confirmed with Toad).
What doesn't work: - outputting the text from the database into an HTML page. This HTML page has the content-type ISO-8859-1, which I cannot change.
Here’s the code that outputs that to the screen:
$xmlData = simplexml_load_string($d['audit_data']);
foreach ($xmlData->audit_detail as $a){
echo "<p> straight from db = ".$a->new_value."</p>";
echo "<p> utf8_decode() = ".utf8_decode($a->new_value)."</p>";
}
I’ve also used a charset changer extension for Firefox: tried ISO-8859-1, UTF-8 and 1252 without success.
If it was UTF-8, shouldn’t I be seeing diamonds with question marks inside (since it's content-type = ISO-8859-1)? If it’s not UTF-8, what is it?
Edit #1
Here's snapshot of other tests that I have made:
$xmlData = simplexml_load_string($d['audit_data']);
foreach ($xmlData->audit_detail as $a){
echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
echo "<hr/>";
echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
echo "<hr/>";
$iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
$iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
echo "<p> iconv() = ".$iso88591_2."</p>";
echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
}
Edit #2
I added the FF proprietary tag, xmp.
Code:
$xmlData = simplexml_load_string($d['audit_data']);
foreach ($xmlData->audit_detail as $a){
echo "<p>encoding is, straight from db, using mb_detect_encoding: ".mb_detect_encoding($a->new_value)."</p>";
echo "<p>encoding is, with utf8_decode, using mb_detect_encoding: ".mb_detect_encoding(utf8_decode($a->new_value))."</p>";
echo "<hr/>";
echo "<p> straight from db = <pre>".$a->new_value."</pre></p>";
echo "<p> utf8_decode() = <pre>".utf8_decode($a->new_value)."</pre></p>";
echo "<hr/>";
$iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $a->new_value);
$iso88591_3 = mb_convert_encoding($a->new_value, 'ISO-8859-1', 'UTF-8');
echo "<p> iconv() = ".$iso88591_2."</p>";
echo "<p> mb_convert_encoding() = ".$iso88591_3."</p>";
echo "<hr/>";
echo "<p>straight from db, using <xmp> = <xmp>".$a->new_value."</xmp></p>";
echo "<p>utf8_decode(), using <xmp> = <xmp>".utf8_decode($a->new_value)."</xmp></p>";
}
Here are some meta tags from the page:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="dc.language" scheme="ISO639-2/T" content="eng" />
IMO, the last meta tag has no bearing.
Edit #3
Source code:
<p>encoding is, straight from db, using mb_detect_encoding: UTF-8</p><p>encoding is, with utf8_decode, using mb_detect_encoding: ASCII</p><hr/><p> straight from db = <pre>Ro马eç ³é ¥n franê¡©s</pre></p><p> utf8_decode() = <pre>Ro?e??n fran?s</pre></p><hr/><p> iconv() = Ro</p><p> mb_convert_encoding() = Ro?e??n fran?s</p><hr/><p>straight from db, using <xmp> = <xmp>Ro马eç ³é ¥n franê¡©s</xmp></p><p>utf8_decode(), using <xmp> = <xmp>Ro?e??n fran?s</xmp></p>
Edit #4
Here is the SQL statement going in to the db:
INSERT INTO ed.audit
(employee_id, audit_date, audit_action, audit_data, user_id)
VALUES (
75,
now(),
'u',
'<?xml version="1.0"?>
<audit><audit_detail><fieldname>role_fra</fieldname><old_value>aRo马e砳頥n franꡩs</old_value><new_value>bRo马e砳頥n franꡩs</new_value></audit_detail></audit>
',
333
);
! Note, the text from this XML doesn't necessarily match the screenshots provided above.
Edit #5
Here's my new function that wraps the CDATA tag around my values for the old_value and new_value nodes:
function ed_audit_node($dom, $field, $new, $old){
//create audit_detail node
$ad = $dom->createElement('audit_detail');
$fn = $dom->createElement('fieldname');
$fn->appendChild($dom->createTextNode($field));
$ad->appendChild($fn);
$ov = $dom->createElement('old_value');
$ov->appendChild($dom->createCDATASection($old));
$ad->appendChild($ov);
$nv = $dom->createElement('new_value');
$nv->appendChild($dom->createCDATASection($new));
$ad->appendChild($nv);
//append to document
return $ad;
}
I also added the encoding to the XML document:
$dom = new DomDocument('1.0', 'UTF-8');
Here's my new simpleXML call:
$xmlData = simplexml_load_string($d['audit_data'], "SimpleXMLElement", LIBXML_NOENT | LIBXML_NOCDATA);
I see the CDATA tags in Toad as well. However, I'm still getting an error:
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0xE9 0x6C 0x65 in <snip>
Edit #6
I just noticed that the jQuery call returns the proper accented characters in the CDATA.
Technically your string is in UTF8, but the HTML encoded characters (when rendered by the browser) are not in UTF8. So ꡩ
is a valid UTF8 String, but the character that is rendered onto the screen from the web browser is not valid UTF8.
I would also wrap your echo to the screen (last 2 lines in your example) like this:
echo "<p>straight from db = <xmp>".$a->new_value."</xmp></p>";
echo "<p>utf8_decode() = <xmp>".utf8_decode($a->new_value)."</xmp></p>";
This will clearly display the point i am making above.
Edit:
The problem actually is an uncontrollable undocumented "Feature" in simplexml_load_string() for PHP. It will automatically convert all characters from their XML entity form strait into their actual char form. The only way to circumvent this is to use simplexml_load_string() like this:
$data = simplexml_load_string(
'<?xml version="1.0" encoding="utf-8"?>
<audit>
<audit_detail>
<fieldname>role_fra</fieldname>
<old_value><![CDATA[aRo马e砳頥n franꡩs]]></old_value>
<new_value><![CDATA[bRo马e砳頥n franꡩs]]></new_value>
</audit_detail>
</audit>',
"SimpleXMLElement",
LIBXML_NOENT | LIBXML_NOCDATA
);
print "<PRE>";
print_r($data);
exit;
You must wrap your elements in <![CDATA[]]>
tags and then pass the LIBXML_NOCDATA option to the xml parser. This will force the things in <![CDATA[]]>
tags to be cast as String type and PHP can properly handle that outside of a SimpleXMLObject.
精彩评论