Encoding troubles - one format to another
I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so
\u00e4
for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as pl开发者_开发百科ain text.
Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?
*here is a sample of the json that it spits out:
({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode()
PHP function, to decode that to a PHP string...
The valid JSON string would be :
$json = '"\u00e4"';
And this :
header('Content-type: text/html; charset=UTF-8');
$php = json_decode($json);
var_dump($php);
would give you the right output :
string 'ä' (length=2)
(It's one character, but two bytes long)
Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...
[Edit] I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode()
might really be the right tool for the job ;-)
The accepted Answer wouldn't work if you try to use the JSON Encode somewhere between the Page execution (e.g. as Plugin for some CMS) or cannot set the header Information. But of course, the Page Header should been set always correctly.
You can provide the json_encode / json_decode Function with additional Parameters to "force" it to use utf-8. I'm building a simple Class for this and using static Methods to get my results.
The key for this is the Flag JSON_UNESCAPED_UNICODE. Use it like this:
Data Class
/*
Data Class
* * * * * * *
Encode and Decode Your String / Object / Array with utf-8 force.
*/
class Data {
// Encode
// @param $a Array Element to decode in JSON
public static function encode($a=[]){
$json = json_encode($a, JSON_UNESCAPED_UNICODE);
return $json;
}
// Decode
// @param $a JSON String
// @param $t Type of return (false = Array, true = Object)
public static function decode($a='', $t=false){
$obj = json_decode($a, $t, 512, JSON_UNESCAPED_UNICODE);
return $obj;
}
}
Usage
// Get your JSON String
$some_json_string = file_get_contents(YOUR_URL);
// Decode as wish
$json_as_array = Data::decode($some_json_string);
$json_as_object = Data::decode($some_json_string, true);
// Debug / use your Content
echo "<pre>";
print_r($json_as_array);
print_r($json_as_object);
echo "</pre>";
精彩评论