Check for duplicates in array with extra test
Hi I have a huge array of words and I want to check for duplicates and also check for plurals and other word endings and beginning that would make it the same word.
So I can have the words but also (make a separate list) of the words that have a basic suffix or prefix on them. Or divide the word with prefix suffix into two开发者_如何学Python parts.
So If I have array...
[repaint, painting, paints, painter, house, car, boat]
it will return...
[re paint, paint ing, paint s, paint er, house, car, boat]
The basis of what you want is a stemming algorithm. The most common one is called Porter2 and I have a JS implementation of it that I wrote a few months ago:
https://github.com/cwolves/stem
It doesn't give you exactly what you want, specifically running your exact words I get:
> token('repaint painting paints painter house car boat');
[ 'repaint', 'paint', 'paint', 'painter', 'hous', 'car', 'boat' ]
You'll notice that the prefixes are not stripped and it doesn't "save" the suffixes ('ing', 's', etc).
There are only a few english prefixes, however, that you can strip beforehand: 're', 'un', 'under', 'vice', etc. Full list at:
http://en.wikipedia.org/wiki/English_prefixes
The suffixes can, for the most part, be extrapolated by taking the difference between the stemmed word and the final word. e.g. "painting" - "paint" means a suffix of "ing".
Note that this is not always the case as the porter2 stemming algorithm sometimes adds an extra 'e' to stemmed words.
精彩评论