Stop words and stemmer in java

2023-03-08 01:36 问答作者：

I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.

String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";

Now that I got those strings

Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?

Stop word: How does this work out? O.o Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.

Hope you can help me out! Thanks.

Edit: It is for a school-re开发者_StackOverflowlated project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^

If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

Which if used on your strings like this:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

Yields this output:

decid bui someth from shop
Nevertheless decidedli bought someth from shop

Yes, you can wrap any stemmer so that you can write something like

String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);

Internally, your stemAndRemoveStopwords would

place all stopWords in a Map for fast reference
initialize an empty StringBuilder to holde the output string
iterate over all words in the input string, and for each word
- search for it in the stopWordList; if found, continue to top of loop
- otherwise, stem it using your preferred stemmer, and add it to to the output string
return the output string

You don't have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a StringBuilder:

StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("\\s+");
for (String word : words) {
    if (stopwordFilter.check(word)) { // Apply stopword filter.
        word = stemmer.stem(word); // Apply stemming algorithm.
        builder.append(word);
    }
}
text = builder.toString();

继续阅读：porter-stemmer stop-words

Stop words and stemmer in java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？