开发者

Displaying top 10 most occuring words in a file in descending order

i am trying to make the code more neat and efficient. iam trying to implement zamzela's[u will find one of the answers down] method. iam having trouble implementing the comparator

public class WordCountExample {

public static void main(String[] args) throws IOException {

    Set<WordCount> wordcount = new HashSet<WordCount>();

    File file = new File("c:\\test\\input1.txt");    //path to the file

    String str = FileUtils.readFileToString(file);   // converts a file into a string


    String[] words = str.split("\\s+");     // split the line on whitespace,
                                            // would return an array of words

    for (String s : words) {

        wordcount.add(new WordCount(s));

        WordCount.incCount();

    }

         /*here WordCount is the name of comparator class*/

    开发者_JS百科      Collections.sort(wordcount,new WordCount());   //getting a error here 


    for (WordCount w : wordcount) {

        System.out.println(w.getValue() + " " + w.getCount());
    }

}

}


Don't store just the word count as the value in your map. Store an object containing the word and its number of occurrences.

public class `WordWithOccurrences` {
    private final String word;
    private int occurrences;
    // ...
}

And your map should thus be a Map<String, WordWithOccurrences>.

Then sort the list of values based on their occurrences property, and iterate through the last 10 values to display their word property (or sort in reverse order and display the first ten values).

You'll have to use a custom comparator to sort your WordWithOccurrences instances.


I think best aproach is to make a class Word

    public class Word implements Comparable<Word>{
    private String value;
    private Integer count;

    public Word(String value) {
        this.value = value;
        count = 1;
    }

    public String getValue() {
        return value;
    }

    public Integer getCount() {
        return count;
    }

    public void incCount() {
        count++;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof Word)
            return value.equals(((Word) obj).getValue());
        else
            return false;
    }

    @Override
    public int hashCode() {
        return value.hashCode();
    }

    @Override
    public int compareTo(Word o) {
        return count.compareTo(o.getCount());
    }
}

you can work with HashSet becase count will be kept in the bean and after you populate everything you can do the sort Collections.sort(array); and take first 10 elements.


finally solved the program. here is a perfectly working program which reads a file, counts the number of words and list the top 10 most occuring words in descending order

import java.io.; import java.util.;

public class Occurance {

public static void main(String[] args) throws IOException {         
    LinkedHashMap<String, Integer> wordcount =
            new LinkedHashMap<String, Integer>();
    try { 
        BufferedReader in = new BufferedReader(
                                  new FileReader("c:\\test\\input1.txt"));
        String str;

        while ((str = in.readLine()) != null) { 
            str = str.toLowerCase(); // convert to lower case 
            String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words

            for( String word : words ) {
              if( word.length() == 0 ) {
                continue; 
              }

              Integer occurences = wordcount.get(word);

              if( occurences == null) {
                occurences = 1;
              } else {
                occurences++;
              }

              wordcount.put(word, occurences);
            }

                } 

        } 
    catch(Exception e){
        System.out.println(e);
    }




    ArrayList<Integer> values = new ArrayList<Integer>();
    values.addAll(wordcount.values());

    Collections.sort(values, Collections.reverseOrder());

    int last_i = -1;


    for (Integer i : values.subList(0, 9)) { 
        if (last_i == i) // without duplicates
            continue;
        last_i = i;




            for (String s : wordcount.keySet()) { 

            if (wordcount.get(s) == i) // which have this value  
               System.out.println(s+ " " + i);


    }
    } 

}


Assuming your program doesn't actually work, here's a hint:

You're comparing on a per character basis yourself, and without going through that code I bet is is wrong:

int idx1 = -1;

for (int i = 0; i < str.length(); i++) { 
  if ((!Character.isLetter(str.charAt(i))) || (i + 1 == str.length())) { 
    if (i - idx1 > 1) { 
       if (Character.isLetter(str.charAt(i))) 
         i++;
       String word = str.substring(idx1 + 1, i);
       if (wordcount.containsKey(word)) { 
          wordcount.put(word, wordcount.get(word) + 1);
       } else { 
          wordcount.put(word, 1);
       } 
     }          
     idx1 = i;
   } 
 } 

Try to use Java's built in functionality:

  String[] words = str.split("\\s+"); //split the line on whitespace, would return an array of words

  for( String word : words ) {
    if( word.length() == 0 ) {
      continue; //for empty lines, split would return at least one element which is ""; so account for that
    }

    Integer occurences = wordcount.get(word);

    if( occurences == null) {
      occurences = 1;
    } else {
      occurences++;
    }

    wordcount.put(word, occurences);
  }


I would have a look at java.util.Comparator. You can define your own comparator which you can pass to Collections.sort(). In your case, you would sort the keys of your wordcount by their count. Finally, just take the first ten items of the sorted collection.

If your wordcount map has too many items, though, you might need something more efficient. It is possible to do this in linear time, by keeping an ordered array of size 10 into which you insert each key, always dropping the key with the lowest count.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜