How can I tell which unicode characters are letters (words) versus being punctation marks?

2022-12-19 18:03 问答作者：

I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.

For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.

The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or 开发者_Go百科japanese?

So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).

I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.

I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).

Alternatively, I could write my own code if I had the complete tables.

I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.

So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?

You can try to use the Unicode character category to figure out what the word separators may be, but be aware that some languages (e.g. Japanese) do not even have word separators.

If you are familiar with Python at all, the Natural Language Toolkit provides chunkers/ lexical tools that will do this across languages. I'd pretend to be smart here and tell you more, but everything I know is out of this book, which I highly recommend. I realize you could code up a technical solution with a regex that would get you 80% of the way to where you want to be, but why reinvent the wheel?

the c-runtime has

ispunct() is a punctuation character
isctrl() is a control character.

In Java, there is static int java.lang.Character.getType(int codePoint) which can be compared to the constants provided in the same class, like this:

switch(Character.getType(codePoint)) {
    case Character.UPPERCASE_LETTER:
    case Character.LOWERCASE_LETTER:
    case Character.TITLECASE_LETTER:
    case Character.MODIFIER_LETTER:
    case Character.OTHER_LETTER:
        // you found a letter
    break;
    case Character.NON_SPACING_MARK:
        // you found a combining diacritical mark
        // see: https://en.wikipedia.org/wiki/Combining_character
    break;
    default:
        // you found other symbols
    break;
}

继续阅读：text unicode

How can I tell which unicode characters are letters (words) versus being punctation marks?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？