how to find frequency of a phrase (multiple token string) inside a document in java?
I want t开发者_运维百科o find the frequency of a multiple-token-string or phrase inside a document. Its not the word/single-term frequency that I am looking for, its always will be multiple-term and the number of terms are dynamic ...
ex : searching the frequency of "words with friends" inside a document!
Any help/pointer will be much appreciated.
Thanks Debjani
You can read the document line by line using Buffered Reader, and then use split function to get the frequency of word/token
int count=0;
while ((strLine = br.readLine()) != null) {
count+ = (strLine.split("words with friends").length-1);
}
return count;
EDIT: And if you want to perform case-insensitive search, then you can use
Pattern myPattern = Pattern.compile("words with friends", Pattern.CASE_INSENSITIVE);
int count=0;
while ((strLine = br.readLine()) != null) {
count+ = (myPattern.split(strLine).length-1);
}
return count;
Why not use regex? Regex is optimized for this sort of task.
http://download.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html
精彩评论