How does tokenization and pattern matching work in Chinese.?

2023-04-10 16:51 问答作者：

This question involves computing as well as knowledge of Chinese. I have chinese queries and I have a separate list of phrases in Chinese I need to be able to find which of these queries have any of these phrases.

In english, it is a very simple task. I don't understand Chinese at all, its semantics, grammar rules etc. and if somebody in this forum who also understands Chinese can help me with some basic understanding and how pattern matching is done for Chinese.

I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this correct?). So are there any rules on how more than one word combine among themselves to stand out as a unit. It is confusing because there are spaces in Chinese writing yet even a unit without space has more than one word in it.

Any links which explain Chinese fr开发者_StackOverflow社区om computational point of view, pattern matching etc would be very useful..

I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this correct?).

In Chinese spaces are rarely used, eg:

递归（英语：Recursion），又譯為遞迴，在数学与计算机科学中，是指在函数的定义中使用函数自身的方法。递归一词还较常用于描述以自相似方法重复事物的过程。例如，当两面镜子相互之间近似平行时，镜中嵌套的图像是以无限递归的形式出现的。

You'll notice what appear to be spaces actually are just Chinese punctuation characters, which just have more padding than usual.

So are there any rules on how more than one word combine among themselves to stand out as a unit. It is confusing because there are spaces in Chinese writing yet even a unit without space has more than one word in it.

Think of it this way: one Chinese character is very, very roughly similar to one English word. Often times two or more characters need to be combined to form one word, and each separate character may mean something completely different depending on context.

To meaningfully tokenize Chinese text you'd have to segment words taking that in consideration.

See Chinese Natural Language Processing and Speech Processing, from the Stanford NLP group.

Ken Lunde's book CJKV Information Processing is probably worth a look. The basic word order is subject - verb - object, but see also "Topic prominence" in http://en.wikipedia.org/wiki/Chinese_grammar

继续阅读：cjk internationalization locale utf-8

How does tokenization and pattern matching work in Chinese.?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？