Convert HTML numbered entities in php to unicode for use on iPhone

2023-01-28 09:21 问答作者：

I'm creating a web service to transfer json to an iPhone app. I'm using json-framework to receive the json, and that works great because it automatically decodes things like "\u2018". The problem I'm running into is there doesn't seem to be a comprehensive way to get all the characters in one fell swoop.

For example html_entity_decode() gets most things, but it leaves behind stuff like ‘ (‘). In order to catch these entities and convert them to something json-framework can use (e.g., \u2018), I'm using this code to convert the &# to \u, convert the numbers to hex, and then strip the ending semicolon.

function func($matches) {
  return "\u" . dechex($matc开发者_如何学Pythonhes[1]);
}
$json = preg_replace_callback("/&#(\d{4});/", "func", $json);

This is working for me at the moment, but it just doesn't feel right. It seems like I'm surely missing some characters that are going to come back to haunt me later.

Does anyone see flaws in this approach? Can anyone think of characters this approach will miss?

Any help would be most appreciated!

From where are you getting this HTML-encoded input? If you're scraping a web page you should be using an HTML parser, which will decode both entity and character references for you. If you are getting them in form input data, you've got a problem with encodings (make sure to serve the page containing the form as UTF-8 to avoid this).

If you must convert an HTML-encoded stretch of literal text to JSON, you should do it by HTML-decoding first then JSON-encoding, rather than attempting to go straight to JSON format (which will fail for a bunch of other characters that need escaping). Use the built-in decoder and encoder functions rather than trying to create JSON-encoded characters like \u.... yourself (as there are traps there).

$html= 'abc &quot; def &#1234; ghi &#x1234; jkl \n mno';
$raw= html_entity_decode($html, ENT_COMPAT, 'utf-8');
$json= json_encode($raw);

"abc \" def \u04d2 ghi \u1234 jkl \\n mno"

‘ is a decimal numbered entity, while I believe \u2018 is a hexadecimal representation. HTML also supports hexadecimal numbered entities (e.g., ‘), but once you've found # as the entity prefix you're looking at either decimal or hex. There are also named entities (e.g., &) but it doesn't sound like you need to cover those cases in your code.

$html_escape = "&quot;Love sex magic rise&quot; &amp; &#23609;&#30495;&#24076; &#8216;";
$utf8 = mb_convert_encoding($html_escape, 'UTF-8', 'HTML-ENTITIES');
echo json_encode(array(
    "title" => $utf8
));

// {"title":"\"Love sex magic rise\" & \u5c39\u771f\u5e0c \u2018"}

This work well for me

继续阅读：json objective-c php unicode

Convert HTML numbered entities in php to unicode for use on iPhone

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？