Turning HTML character entities to 'regular' letters... why is it only partially working?

2022-12-21 11:26 问答作者：

I'm using all of the below to take a field called 'code' from my database, get rid of all the HTML entities开发者_运维问答, and print it 'as usual' to the site:

   <?php $code = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $code);
   $code = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $code); 
   $code = html_entity_decode($code); ?>

However the exported code still looks like this:

progid:DXImageTransform.Microsoft.AlphaImageLoader(src=â€™img/the_image.pngâ€™);

See what's going on there? How many other things can I run on the string to turn them into darn regular characters?!

Thanks!

Jack

â€™ is what you get when you read the UTF-8 encoded character ’ (RIGHT SINGLE QUOTATION MARK, U+2019) as if it were encoded as windows-1252. In other words, you have two problems: you're using the wrong encoding to read the wrong character.

HTML attribute values are supposed to be enclosed in ASCII apostrophes or quotation marks, not curly quotes. The numeric entities you're converting should be ' or &#x27 (apostrophe) or " or " (quotation mark). Instead, you appear to have , which represents the same character as ’, &#8217, or ’.

As for the second problem, the resulting text seems to be encoded as UTF-8, but at some point it's being read as if it were windows-1252. In UTF-8, the character ’ is represented by the three-byte sequence E2 80 99, but windows-1252 converts each byte separately, to â, €, and ™. Wherever that's happening, it's not in the code you showed us.

The good news is that your preg_replace code seems to be working correctly. ;) But I think the others are right when they say you can use html_entity_decode() alone for that part.

It could be you are using a character coding that is different than your page, ISO v.s. UTF-8, for example.

chr only works on ASCII, so your non-ASCII characters are getting messed up. Unless I'm misunderstanding what you're trying to do, you just need a single call to html_entity_decode() with the correct charset parameter, and can get rid of the other two lines.

Although the name doesn’t reflect it, html_entity_decode does also convert numeric character references.

// α (U+03B1) == 0xCEB1 (UTF-8)
var_dump("\xCE\xB1" == html_entity_decode('&#x03B1;', ENT_COMPAT, 'UTF-8'));

继续阅读：character-encoding character-entities php regex

Turning HTML character entities to 'regular' letters... why is it only partially working?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？