PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

2023-02-07 14:13 问答作者：

I have looked around and can't seem to find a solution so here it is.

I have the following code:

$file = "adhddrugs.xml";
$xmlstr = simple开发者_开发技巧xml_load_file($file);
echo $xmlstr->report_description;

This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.

Things I have tried are:

echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */

also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.

Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.

Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.

I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.

Thanks again.

This is the simple version, but even trying this any hyphens apostrophes are turned into: ^a (euro sign) trademark sign.

This is caused by incorrect charset guessing (and possibly recoding).

If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters â€™ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2

Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1

PS. this is a very common problem. Just do a Google or Bing search with query doesnâ€™t -doesn't and you'll see many pages with this same encoding error.

Do you know the document's character set?

You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.

Make sure you have set up SimpleXML to use UTF-8 too.

Be sure that all the entities are encoded using hex notation, not HTML entities.

Also maybe:

$string = html_entity_decode($string, ENT_QUOTES, "utf-8");

will help.

This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).

This does the trick for latin languages.

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).

<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 </head>
 <body>
  Hello world
 </body>
</html>

if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

继续阅读：php simplexml xml

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？