remove Stopwords in java
I have a list of stop words which contain around 30 words and a set of articles .
I want to parse each article and remove those stop words from it .
I am not sure what is the 开发者_运维技巧most effecient way to do it.
for instance I can loop through stop list and replace the word in article if exist with whitespace but it does not seem good .
Thanks
- Put stop words into a
java.util.Set
- Split input into words
- For each word in input, see if it's contained in the set of stopwords, write to output if not
Replacing the words will be inefficient. Your best bet is probably to parse the article word by word, and copy each word to a new StringBuffer; unless it is a stopword, in which case you copy whatever you want in its place. StringBuffer is much more efficient than String here.
How you store the stopwords is probably unimportant if there are only thirty or so. A Set is probably a good bet.
According to the Sun Java Tutorials, you can use the Perl-compatible \b
deliminator in your regular expressions. If you surround the word with them, it will match only that word, whether it's followed by or prefixed with a punctuation character or whitespace.
Read a word from the input, and copy it to your StringBuilder (or wherever you're putting the result) if and only if it's not in the list of stop words. You'll be able to search for them faster if you put the stop words into something like a HashTable.
Edit: oops, don't know what I was thinking, but you want a set, not a HashTable (or any other Dictionary).
精彩评论