How to parse unicode format (e.g. \u201c, \u2014) using PHP
I am pulling data from the Facebook graph which has characters encoded like so: \u2014
and \u2014
Is there a function to convert those characters into HTML? i.e \u2014 -> —
If you have some further reading on these character codes), or suggested reading about unicode in g开发者_运维知识库eneral I would appreciate it. This is so confusing to me. I don't know what to call these codes... I guess unicode, but unicode seems to mean a whole lot of things.
that's not entirely true bobince. How do you handle json containing spanish accents? there are 2 problems. I make FB.api(url, function(response) ... var s=JSON.stringify(response);
and pass it to a php script via $.post
First I get a truncated string. I need escape(JSON.stringify(response)) Then I get a full json encoded string with spanish accents. As a test, I place it in a text file I load with file_get_contents and apply php json_decode and get nothing. You first need utf8_encode.
And then you get awaiting object of your desire. After a full day of test and google without any result when decoding unicode properly, I found your post. So many thanks to you.
Someone asked me to solve the problem of Arabic texts from the Facebook JSON archive, maybe this code helps someone who searches for reading Arabic texts from Facebook (or instagram) JSON:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));
Facebook Graph API returns JSON objects. Use json_decode() to read them into PHP and you do not have to worry about handling string literal escapes like \uNNNN
. Don't try to decode JSON/JavaScript string literals by yourself, or extract chosen properties using regex.
Having read the string value, you'll have a UTF-8-encoded string. If your target HTML is also UTF-8-encoded, you don't need to replace —
(U+2014) with any entity reference. Just use htmlspecialchars()
on the string when outputting it, so that any <
or &
characters in the string are properly encoded.
If you do for some reason need to produce ASCII-safe HTML, use htmlentities()
with the charset
arg set to 'utf-8'
.
精彩评论