Turning HTML character entities to 'regular' letters... why is it only partially working?
I'm using all of the below to take a field called 'code' from my database, get rid of all the HTML entities开发者_运维问答, and print it 'as usual' to the site:
<?php $code = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $code);
$code = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $code);
$code = html_entity_decode($code); ?>
However the exported code still looks like this:
progid:DXImageTransform.Microsoft.AlphaImageLoader(src=’img/the_image.png’);
See what's going on there? How many other things can I run on the string to turn them into darn regular characters?!
Thanks!
Jack
’
is what you get when you read the UTF-8 encoded character ’
(RIGHT SINGLE QUOTATION MARK, U+2019) as if it were encoded as windows-1252. In other words, you have two problems: you're using the wrong encoding to read the wrong character.
HTML attribute values are supposed to be enclosed in ASCII apostrophes or quotation marks, not curly quotes. The numeric entities you're converting should be '
or '
(apostrophe) or "
or "
(quotation mark). Instead, you appear to have ’
, which represents the same character as ’
, ’
, or ’
.
As for the second problem, the resulting text seems to be encoded as UTF-8, but at some point it's being read as if it were windows-1252. In UTF-8, the character ’
is represented by the three-byte sequence E2 80 99
, but windows-1252 converts each byte separately, to â
, €
, and ™
. Wherever that's happening, it's not in the code you showed us.
The good news is that your preg_replace
code seems to be working correctly. ;) But I think the others are right when they say you can use html_entity_decode()
alone for that part.
It could be you are using a character coding that is different than your page, ISO v.s. UTF-8, for example.
chr only works on ASCII, so your non-ASCII characters are getting messed up. Unless I'm misunderstanding what you're trying to do, you just need a single call to html_entity_decode() with the correct charset parameter, and can get rid of the other two lines.
Although the name doesn’t reflect it, html_entity_decode
does also convert numeric character references.
// α (U+03B1) == 0xCEB1 (UTF-8)
var_dump("\xCE\xB1" == html_entity_decode('α', ENT_COMPAT, 'UTF-8'));
精彩评论