开发者

How can I find out how is a punctuation character form in UTF 8?

I have a set of characters like

., !, ?, ;, (space)

and a string, which may or may not be UTF 8 (any language).

Is there a easy way to find out if the string has one of the character set above?

For example:

这是一个在中国的字符串。

which translates to

This is a string in chinese.

The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?

开发者_JS百科Or maybe there's a list somewhere with Unicode punctuation character codes?


In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_matchDocs and the u modifier.

echo preg_match('/pP$/u', $str);

However, your string needs to be UTF-8 to do that.

You can test this on your own, I created a little script that tests for all properties via preg_match:

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.


Yes, (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:

preg_match('/[.!?;。]/u', $str, $match)

This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.

If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:

/\p{P}/u


you are not trying to transliterate, you are trying to translate!

UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world

what you are trying to do is something like this:

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

that not works with your chinese example

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜