开发者

PHP's SimpleXML not handling &#8217 ; properly

I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this into a ’. What can I do to stop this?

Just to answer some of the questions that have come up - I'm pulling an RSS feed using CURL. If I output this directly to the browser, the ’ displays as ’ which is what's expected. When I create a new SimpleXMLElement using this, (e.g开发者_如何学运维. $xml = new SimpleXmlElement($raw_feed); and dump the $xml variable, every instance of ’ is replaced with ’.

It appears that SimpleXML is having trouble with UTF-8 ampersand encoded characters. (The XML declaration specifies UTF-8.)

I do have control over the feed after CURL has retrieved the feed before it's used to construct a SimpleXML element.


’ represents the Unicode character (U+2019) that is encoded with 0xE28099 in UTF-8. And when that byte sequence is interpreted with Windows-1252, it represents the characters â (0xE2), (0x80), and (0x99).

That means SimpleXML handles the input as UTF-8 encoded but you interpret its output as Windows-1252. And unless you really want to use Windows-1252, you are probably just missing to specify the character encoding of your output properly.


It came down to having to set the default encoding to UTF-8 in four places:

  1. The default locale at the head of the file: setlocale(LC_ALL, 'en_US.UTF8');
  2. Encoding the string that comes out of CURL: utf8_encode($string);
  3. Setting the MySQL connection to use UTF-8 by default: mysqli_set_charset($database_insert_connection, 'utf8');
  4. Setting the appropriate collation in the MySQL database to utf8_general_ci

If outputting to the browser, setting the appropriate header (e.g. header ('Content-type: text/html; charset=utf-8');)

Hope this helps someone in the future!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜