开发者

More efficient or more modern? Reading in & Sorting A Text File With Java

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.

My code is below.

My questions are:

  1. Is my file input routine the most respectful of JVM resources?

  2. Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?

  3. Am I using the Collection classes and interface the most efficient way I can?

Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.

import java.io.*;
import  java.util.*;

public class Sort
{
    public static void main(String[] args)
    {
        String   sUnsorted       = null;
        String[] saSplit         = null;

        int iCurrentWordCount    = 1;
        String currentword       = null;
        String pastword          = "";

        // Read the text file into a string
        sUnsorted = readIn("input1.txt");

        // Parse the String by white space into String array of single words
        saSplit   = sUnsorted.split("\\s+");

        // Sort the String array in descending order
        java.util.Arrays.sort(saSplit, Collections.reverseOrder());


        // 开发者_运维知识库Count the occurences of each word in the String array
        for (int i = 0; i < saSplit.length; i++ )
        {

            currentword = saSplit[i];

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else if (!currentword.equals(pastword))
            {

                if ( !pastword.equals("") )
                {

                    System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);

                }


                System.out.println(currentword );
                iCurrentWordCount = 1;

            }

            pastword = currentword;  
        }// end for loop

       // Print out the count for the last word processed
       System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);



    }// end funciton main()


    // Read The Input File Into A String      
    public static String readIn(String infile)
    {
        String result = " ";

        try
        {
            FileInputStream file = new FileInputStream (infile);
            DataInputStream in   = new DataInputStream (file);
            byte[] b             = new byte[ in.available() ];

            in.readFully (b);
            in.close ();

            result = new String (b, 0, b.length, "US-ASCII");

        }
        catch ( Exception e )
        {
            e.printStackTrace();
        }

        return result;
    }// end funciton readIn()

}// end class Sort()

/////////////////////////////////////////////////
//  Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////

import java.io.*;
import java.util.*;

public class Sort2
{
    public static void main(String[] args) throws Exception
    {
        // Scanner will tokenize on white space, like we need
        Scanner scanner               = new Scanner(new FileInputStream("input1.txt"));
        ArrayList <String> wordlist   = new  ArrayList<String>();
        String currentword            = null;   
        String pastword               = null;
        int iCurrentWordCount         = 1;       

        while (scanner.hasNext())
            wordlist.add(scanner.next() );

        // Sort in descending natural order
        Collections.sort(wordlist);
        Collections.reverse(wordlist);

        for ( String temp : wordlist )
        {
            currentword = temp;

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else //if (!currentword.equals(pastword))
            {
                if ( pastword != null )
                    System.out.println("Count for " + pastword + ": " +  
                                                            CurrentWordCount);   

                System.out.println(currentword );
                iCurrentWordCount = 1;    
            }

            pastword = currentword;  
        }// end for loop

        System.out.println("Count for " + currentword + ": " + iCurrentWordCount);

    }// end funciton main()


}// end class Sort2


  1. There are more idiomatic ways of reading in all the words in a file in Java. BreakIterator is a better way of reading in words from an input.

  2. Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.

  3. You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.

  4. Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.

  5. You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.

That said, don't concern yourself with micro optimizations until you have "correct" behavior.


  • the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
  • you can provide customer delimiters to the Scanner type for breaking streams

Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.

Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.


To elaborate, you can count words in a map like this:

void increment(Map<String, Integer> wordCountMap, String word) {
  Integer count = wordCountMap.get(word);
  wordCountMap.put(word, count == null ? 1 : ++count);
}

Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)


Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.


Some other things you might find interesting:

To read the file you could use a BufferedReader (if it's text only).

This:

for (int i = 0; i < saSplit.length; i++ ){
    currentword = saSplit[i];
    [...]
}

Could be done using a extended for-loop (the Java-foreach), like shown here.

if ( currentword.equals(pastword) ){
    [...]
} else if (!currentword.equals(pastword)) {
    [...]
}

In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).

if ( !pastword.equals("") )

I think using length is faster here:

if (!pastword.length == 0)


Input method:

Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.

I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.

Collections:

You're really not use Collections classes, you're using an array. Your implementation seems fine, but...

There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜