PHP construct a Unicode string?
Given a Unicode decimal or hex number for a character that's wanting to be output from a CLI PHP script, how can PHP generate it? The chr()
function seems to not generate the proper output. Here's my test script, using the Section Break character U+00A7 (A7 in hex, 167 in decimal, should be represented as C2 A7 in UTF-8) as a test:
<?php
echo "Section sign: ".chr(167)."\n"; // Using CHR function
echo "Section sign: ".chr(0xA7)."\n";
echo "Section sign: ".pack("c", 0xA7)."\n"; // Using pack function?
echo "Section sign: 开发者_如何学Python§\n"; // Copy and paste of the symbol into source code
The output I get (via a SSH session to the server) is:
Section sign: ?
Section sign: ?
Section sign: ?
Section sign: §
So, that proves that the terminal font I'm using has the Section Break character in it, and the SSH connection is sending it along successfully, but chr()
isn't constructing it properly when constructing it from the code number.
If all I've got is the code number and not a copy/paste option, what options do I have?
Assuming you have iconv
, here's a simple way that doesn't involve implementing UTF-8 yourself:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
PHP has no knowledge of Unicode when excluding the mb_ functions and iconv. You'll have to UTF-8 encode the character yourself.
For that, Wikipedia has an excellent overview on how UTF-8 is structured. Here's a quick, dirty and untested function based on that article:
function codepointToUtf8($codepoint)
{
if ($codepoint < 0x7F) // U+0000-U+007F - 1 byte
return chr($codepoint);
if ($codepoint < 0x7FF) // U+0080-U+07FF - 2 bytes
return chr(0xC0 | ($codepoint >> 6)).chr(0x80 | ($codepoint & 0x3F);
if ($codepoint < 0xFFFF) // U+0800-U+FFFF - 3 bytes
return chr(0xE0 | ($codepoint >> 12)).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
else // U+010000-U+10FFFF - 4 bytes
return chr(0xF0 | ($codepoint >> 18)).chr(0x80 | ($codepoint >> 12) & 0x3F).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
}
Don't forget that UTF-8 is a variable-length encoding.
§
is not included in the first 128 (ASCII) characters that UTF-8 is able to display in one byte. §
is a multi-byte character in UTF-8, prepended by a c2
byte that signifies first byte of a two-byte sequence.
. This should work:
echo "Section sign: ".chr(0xC2).chr(0xA7)."\n";
chr
(PHP 4, PHP 5)
chr — Return a specific character
Report a bug
Description
string chr ( int $ascii )
Returns a one-character string containing the character specified by ascii.
This function complements ord().
important is the word ascii :) try this one:
function uchr ($codes) {
if (is_scalar($codes)) $codes= func_get_args();
$str= '';
foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
return $str;
}
echo "Section sign: ".uchr(167)."\n"; // Using CHR function
echo "Section sign: ".uchr(0xA7)."\n";
I know I am reopening an old, solved issue, however since I stumbled into that topic searching for help, I thought I would share the solution I ended up with. The initial person asking the question might be interested in refactoring his/her code for the best.
Manually reprogramming ascii-to-unicode is like reinventing the wheel, not talking about errors/performance potential.
The best solution I found was to use:
pack
to create values from input data, using the appropriate codes to eat the right amount of data, usuallypack("H*", <input data>)
to read from hexadecimal valuesmb_convert_encoding
to convert ASCII strings to unicode ones, usingmb_convert_encoding(<ASCII string>, "UTF-8")
. If the input string is not recognized properly, a third parameter of this function allows to specify the input encoding
精彩评论