Displaying top 10 most occuring words in a file in descending order
i am trying to make the code more neat and efficient. iam trying to implement zamzela's[u will find one of the answers down] method. iam having trouble implementing the comparator
public class WordCountExample {
public static void main(String[] args) throws IOException {
Set<WordCount> wordcount = new HashSet<WordCount>();
File file = new File("c:\\test\\input1.txt"); //path to the file
String str = FileUtils.readFileToString(file); // converts a file into a string
String[] words = str.split("\\s+"); // split the line on whitespace,
// would return an array of words
for (String s : words) {
wordcount.add(new WordCount(s));
WordCount.incCount();
}
/*here WordCount is the name of comparator class*/
开发者_JS百科 Collections.sort(wordcount,new WordCount()); //getting a error here
for (WordCount w : wordcount) {
System.out.println(w.getValue() + " " + w.getCount());
}
}
}
Don't store just the word count as the value in your map. Store an object containing the word and its number of occurrences.
public class `WordWithOccurrences` {
private final String word;
private int occurrences;
// ...
}
And your map should thus be a Map<String, WordWithOccurrences>
.
Then sort the list of values based on their occurrences property, and iterate through the last 10 values to display their word property (or sort in reverse order and display the first ten values).
You'll have to use a custom comparator to sort your WordWithOccurrences
instances.
I think best aproach is to make a class Word
public class Word implements Comparable<Word>{
private String value;
private Integer count;
public Word(String value) {
this.value = value;
count = 1;
}
public String getValue() {
return value;
}
public Integer getCount() {
return count;
}
public void incCount() {
count++;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof Word)
return value.equals(((Word) obj).getValue());
else
return false;
}
@Override
public int hashCode() {
return value.hashCode();
}
@Override
public int compareTo(Word o) {
return count.compareTo(o.getCount());
}
}
you can work with HashSet becase count will be kept in the bean and after you populate everything you can do the sort Collections.sort(array); and take first 10 elements.
finally solved the program. here is a perfectly working program which reads a file, counts the number of words and list the top 10 most occuring words in descending order
import java.io.; import java.util.;
public class Occurance {
public static void main(String[] args) throws IOException {
LinkedHashMap<String, Integer> wordcount =
new LinkedHashMap<String, Integer>();
try {
BufferedReader in = new BufferedReader(
new FileReader("c:\\test\\input1.txt"));
String str;
while ((str = in.readLine()) != null) {
str = str.toLowerCase(); // convert to lower case
String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words
for( String word : words ) {
if( word.length() == 0 ) {
continue;
}
Integer occurences = wordcount.get(word);
if( occurences == null) {
occurences = 1;
} else {
occurences++;
}
wordcount.put(word, occurences);
}
}
}
catch(Exception e){
System.out.println(e);
}
ArrayList<Integer> values = new ArrayList<Integer>();
values.addAll(wordcount.values());
Collections.sort(values, Collections.reverseOrder());
int last_i = -1;
for (Integer i : values.subList(0, 9)) {
if (last_i == i) // without duplicates
continue;
last_i = i;
for (String s : wordcount.keySet()) {
if (wordcount.get(s) == i) // which have this value
System.out.println(s+ " " + i);
}
}
}
Assuming your program doesn't actually work, here's a hint:
You're comparing on a per character basis yourself, and without going through that code I bet is is wrong:
int idx1 = -1;
for (int i = 0; i < str.length(); i++) {
if ((!Character.isLetter(str.charAt(i))) || (i + 1 == str.length())) {
if (i - idx1 > 1) {
if (Character.isLetter(str.charAt(i)))
i++;
String word = str.substring(idx1 + 1, i);
if (wordcount.containsKey(word)) {
wordcount.put(word, wordcount.get(word) + 1);
} else {
wordcount.put(word, 1);
}
}
idx1 = i;
}
}
Try to use Java's built in functionality:
String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words
for( String word : words ) {
if( word.length() == 0 ) {
continue; //for empty lines, split would return at least one element which is ""; so account for that
}
Integer occurences = wordcount.get(word);
if( occurences == null) {
occurences = 1;
} else {
occurences++;
}
wordcount.put(word, occurences);
}
I would have a look at java.util.Comparator
. You can define your own comparator which you can pass to Collections.sort()
. In your case, you would sort the keys of your wordcount
by their count. Finally, just take the first ten items of the sorted collection.
If your wordcount
map has too many items, though, you might need something more efficient. It is possible to do this in linear time, by keeping an ordered array of size 10 into which you insert each key, always dropping the key with the lowest count.
精彩评论