PHP's SimpleXML not handling ’ ; properly
I'm parsing an RSS feed that has an ’
in it. SimpleXML turns this into a ’. What can I do to stop this?
Just to answer some of the questions that have come up - I'm pulling an RSS feed using CURL. If I output this directly to the browser, the ’
displays as ’ which is what's expected. When I create a new SimpleXMLElement using this, (e.g开发者_如何学运维. $xml = new SimpleXmlElement($raw_feed);
and dump the $xml
variable, every instance of ’
is replaced with ’.
It appears that SimpleXML is having trouble with UTF-8 ampersand encoded characters. (The XML declaration specifies UTF-8.)
I do have control over the feed after CURL has retrieved the feed before it's used to construct a SimpleXML element.
’
represents the Unicode character ’
(U+2019) that is encoded with 0xE28099 in UTF-8. And when that byte sequence is interpreted with Windows-1252, it represents the characters â
(0xE2), €
(0x80), and ™
(0x99).
That means SimpleXML handles the input as UTF-8 encoded but you interpret its output as Windows-1252. And unless you really want to use Windows-1252, you are probably just missing to specify the character encoding of your output properly.
It came down to having to set the default encoding to UTF-8 in four places:
- The default locale at the head of the file:
setlocale(LC_ALL, 'en_US.UTF8');
- Encoding the string that comes out of CURL:
utf8_encode($string);
- Setting the MySQL connection to use UTF-8 by default:
mysqli_set_charset($database_insert_connection, 'utf8');
- Setting the appropriate collation in the MySQL database to
utf8_general_ci
If outputting to the browser, setting the appropriate header (e.g. header ('Content-type: text/html; charset=utf-8');
)
Hope this helps someone in the future!
精彩评论