开发者

Check for duplicates in array with extra test

Hi I have a huge array of words and I want to check for duplicates and also check for plurals and other word endings and beginning that would make it the same word.

So I can have the words but also (make a separate list) of the words that have a basic suffix or prefix on them. Or divide the word with prefix suffix into two开发者_如何学Python parts.

So If I have array...

[repaint, painting, paints, painter, house, car, boat]

it will return...

[re paint, paint ing, paint s, paint er, house, car, boat]


The basis of what you want is a stemming algorithm. The most common one is called Porter2 and I have a JS implementation of it that I wrote a few months ago:

https://github.com/cwolves/stem

It doesn't give you exactly what you want, specifically running your exact words I get:

> token('repaint painting paints painter house car  boat');
[ 'repaint', 'paint', 'paint', 'painter', 'hous', 'car', 'boat' ]

You'll notice that the prefixes are not stripped and it doesn't "save" the suffixes ('ing', 's', etc).

There are only a few english prefixes, however, that you can strip beforehand: 're', 'un', 'under', 'vice', etc. Full list at:

http://en.wikipedia.org/wiki/English_prefixes

The suffixes can, for the most part, be extrapolated by taking the difference between the stemmed word and the final word. e.g. "painting" - "paint" means a suffix of "ing".

Note that this is not always the case as the porter2 stemming algorithm sometimes adds an extra 'e' to stemmed words.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜