How can I find out how is a punctuation character form in UTF 8?
I have a set of characters like
.
, !
, ?
, ;
, (space)
and a string, which may or may not be UTF 8 (any language).
Is there a easy way to find out if the string has one of the character set above?
For example:
这是一个在中国的字符串。
which translates to
This is a string in chinese.
The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?
开发者_JS百科Or maybe there's a list somewhere with Unicode punctuation character codes?
In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_match
Docs and the u
modifier.
echo preg_match('/pP$/u', $str);
However, your string needs to be UTF-8
to do that.
You can test this on your own, I created a little script that tests for all properties via preg_match
:
Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).
Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).
Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.
Yes, 。
(U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than .
(U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:
preg_match('/[.!?;。]/u', $str, $match)
This will return either 0
or 1
and $match
will contain the matched character. With this it’s important that your string in $str
is properly encoded in UTF-8.
If you want to match any Unicode punctuation character, you can use the pattern \p{P}
to describe the Unicode character property instead:
/\p{P}/u
you are not trying to transliterate, you are trying to translate!
UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world
what you are trying to do is something like this:
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "à è ò ù");
that not works with your chinese example
精彩评论