How to escape Chinese Unicode characters in URL?
I have Chinese users of my PHP web application who enter products into our system. The information the’re entering is for example a product title and price.
We would like to use the product title to generate a nice URL slug for those product. Seems like we cannot just use Chinese as HREF attributes.
Does anyone开发者_开发问答 know how we handle a title like “婴儿服饰” so that we can generate a clean url like http://www.site.com/婴儿服饰
?
Everything works fine for “normal” languages, but high UTF‐8 languages give us problems.
Also, when generating the clean URL, we want to keep SEO in mind, but I have no experience with Chinese in that matter.
If your string is already UTF-8, just use rawurlencode
to encode the string properly:
$path = '婴儿服饰';
$url = 'http://example.com/'.rawurlencode($path);
UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in URIs which is why you need to use the percent-encoding). The result is the same as in tchrist’s example:
http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
This code, which uses the CPAN module, URI::Escape:
#!/usr/bin/env perl
use v5.10;
use utf8;
use URI::Escape qw(uri_escape_utf8);
my $url = "http://www.site.com/";
my $path = "婴儿服饰";
say $url, uri_escape_utf8($path);
when run, prints:
http://www.site.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
Is that what you're looking for?
BTW, those four characters are:
CJK UNIFIED IDEOGRAPH-5A74
CJK UNIFIED IDEOGRAPH-513F
CJK UNIFIED IDEOGRAPH-670D
CJK UNIFIED IDEOGRAPH-9970
Which, according to the Unicode::Unihan database, seems to be yīng ér fú shì, or perhaps just ying er fú shi per Lingua::ZH::Romanize::Pinyin. And maybe even jing¹ jan⁴ fuk⁶ sik¹ or jing˥ jan˨˩ fuk˨ sik˥, using the Cantonese version from Unicode::Unihan.
Use encoded url as href attribute of the link, and keep original characters as content of the link.
Then you could have the safe url and make the webpage SEO friendly.
// Safely convert url like "http://example.com/婴儿服饰" to valid encoded string
// => http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
// KEY: multipart character occupies more than one byte
function autoEncodeMultibyteChars($url) {
$encoding = 'UTF-8';
$mbLen = mb_strlen($url, $encoding);
$append = '';
for ($idx = 0; $idx < $mbLen; $idx++) {
$char = mb_substr($url, $idx, 1, $encoding);
if (strlen($char) > 1) { // multibyte char
$append .= rawurlencode($char);
} else {
$append .= $char;
}
}
return $append;
}
精彩评论