开发者

A confusion about the porter stemming algorithm

I am trying to implement porter stemming algorithm, but I stumbled at this point

where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as

[C](VC){m}[V].

m will be called the \measure\ of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:

m=0    TR,  EE,  TREE,  Y,  BY.
m=1    TROUBLE,  OATS,  TREES,  IVY.
m=2    TROUBLES,  PRIVATE,  OATEN,  ORRERY.
开发者_StackOverflow社区

I don't understand what is this "measure" and what does it stand for?


Looks like the measure is the number of times a vowel is immediately followed by a consonant. For example,

"TROUBLES" has:

Optional initial consonants [C] = "TR".

First vowels-consonants group (VC) = "OUBL".

Second vowels-consonants group (VC) = "ES".

Optional ending vowels [V] is empty.

So the measure is two, the number of times (VC) was "matched".

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜