regex breaking Chinese string
When开发者_开发问答 i run this code and similar some Chinese the ni (你) character (maybe others) gets chopped of and broken.
$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\s,]+/", $sample);
var_dump($parts);
//outputs
array(4) {
[0]=>
string(2) "�"
[1]=>
string(9) "不喜欢"
[2]=>
string(6) "香蕉"
[3]=>
string(3) "吗"
}
//in 我觉得 你很 麻烦
//out
array(4) {
[0]=>
string(9) "我觉得"
[1]=>
string(2) "�"
[2]=>
string(3) "很"
[3]=>
string(6) "麻烦"
}
Is my regex wrong?
If your string is in UTF-8, you must use the u
modifier:
$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\\s,]+/u", $sample);
var_dump($parts);
If it's in another encoding, see unicornaddict's answer.
Since the input string is multi-byte, I guess you'll have to use mb_split
in place of preg_split.
精彩评论