Tokenizer, Stop Word Removal, Stemming in Java

2022-12-10 21:05 问答作者：

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system.

For example:

"The 开发者_运维问答big fat cat, said 'your funniest guy i know' to the kangaroo..."

the tokenizer would remove the punctuation and return an ArrayList of words

the stop word remover would remove words like "the", "to", etc

the stemmer would reduce each word the their 'root', for example 'funniest' would become funny

Many thanks in advance.

AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop word removal. In combination with the Lucene contrib-snowball (which includes work from Snowball) project you can do the stemming too.

But for stemming also consider this answer to: Stemming algorithm that produces real words

These are standard requirements in Natural Language Processing so I would look in such toolkits. Since you require Java I'd start with OpenNLP: http://opennlp.sourceforge.net/

If you can look at other languages there is also NLTK (Python)

Note that "your funniest guy i know" is not standard syntax and this makes it harder to process than "You're the funniest guy I know". Not impossible, but much harder. I don't know of any system that would equate "your" to "you are".

I have dealt with the issue on a number of tasks I have worked with, so let me give a tokenizer suggestion. As I do not see it given directly as an answer, I often use edu.northwestern.at.utils.corpuslinguistics.tokenizer.* as my family of tokenizers. I see a number of cases where I used the PennTreebankTokenizer class. Here is how you use it:

    WordTokenizer wordTokenizer = new PennTreebankTokenizer();
    List<String> words = wordTokenizer.extractWords(text);

The link to this work is here. Just a disclaimer, I have no affiliation with Northwestern, the group, or the work they do. I am just someone who uses the code occasionally.

Here is comprehensive list of NLP tools. Sometime it makes sense to create these yourself as they will be lighter and you would have more control to the inner workings: use simple regular expression for tokenizations. For stop words just push the list below or some other list to a HashSet:

common-english-words.txt

Here is one of many Java implementation of porter stemer).

继续阅读：stemming stop-words tokenize

Tokenizer, Stop Word Removal, Stemming in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？