开发者

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

I have looked around and can't seem to find a solution so here it is.

I have the following code:

$file = "adhddrugs.xml";
$xmlstr = simple开发者_开发技巧xml_load_file($file);
echo $xmlstr->report_description;

This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.

Things I have tried are:

echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */

also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.

Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.

Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.

I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.

Thanks again.


This is the simple version, but even trying this any hyphens apostrophes are turned into: ^a (euro sign) trademark sign.

This is caused by incorrect charset guessing (and possibly recoding).

If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters ’ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2

Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1

PS. this is a very common problem. Just do a Google or Bing search with query doesn’t -doesn't and you'll see many pages with this same encoding error.


Do you know the document's character set?

You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.


Make sure you have set up SimpleXML to use UTF-8 too.

Be sure that all the entities are encoded using hex notation, not HTML entities.

Also maybe:

$string = html_entity_decode($string, ENT_QUOTES, "utf-8");

will help.


This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).

This does the trick for latin languages.

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).

<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 </head>
 <body>
  Hello world
 </body>
</html>

if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜