开发者

Collections framework to to count a file

I am using a MSDOS to pipe in a file.. I am trying to write a program that counts how many times each word pair appears in a text file. A word pair consists of two consecutive words (i.e. a word and the word that directly follows it). In the first sentence of this paragraph, the words “counts” and “how” are a word pair.

What i want the program to do is, take this input :

abc def abc ghi abc def ghi jkl abc xyz abc abc abc ---

Should produce this output:

abc:
abc, 2
def, 2
ghi, 1
xyz, 1

def:
abc, 1
ghi, 1

ghi:
abc, 1
kl, 1

jkl:
abc, 1

xyz:
abc, 1

BTW: i am excluding "a", "the", "and" which has nothing to do with the word pair..

What is the best way to do this? please be nice, 开发者_开发知识库I am new to java.. this is what i have so far..

import java.util.Scanner;
import java.util.ArrayList;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.HashSet;

public class Project1
{
    public static void main(String[] args)
    {
        Scanner sc = new Scanner(System.in); 
        String word;
        String grab;
        int number;

        // ArrayList<String> a = new ArrayList<String>();
        // TreeSet<String> words = new TreeSet<String>();
        HashSet<String> uniqueWords = new HashSet<String>();

        System.out.println("project 1\n");

        while (sc.hasNext()) 
        {
            word = sc.next();
            word = word.toLowerCase();

            if (word.matches("a") || word.matches("and") || word.matches("the"))
            {
            }
            else
            {
                uniqueWords.add(word);
            }

            if (word.equals("---"))
            {
                break;
            }
        }

        System.out.println("size");
        System.out.println(uniqueWords.size());

        System.out.println("unique words");
        System.out.println(uniqueWords.size());

        System.out.println("\nbye...");
    }
}

Sorry about the formatting. Its hard to get it right in here...


What about using a Map:

Map<String, List<String>> words = new HashMap<String, List<String>>();

The keys in the map would be unique words, and the values would each be lists of words that followed that unique word. The data structure might look like:

Key    |    Value
--------------------------
abc    |    def, ghi, jkl
def    |    jkl, mno


That code looks like a fragment of something which counts unique words, which isn't your problem. The structure I suggest you need is a Map whose key is a "word pair" (make a class for this) and whose value is the number of times that "word pair" appears in the input.


does it have to be java? - this is really much more straightforward in perl

(also - is this a homework problem? :) )


One possible approach would be to take your uniqueWords Set and wrap it in a List (to get direct access by index). You could then create a matrix of ints, think of it as a table that has all words in both the columns and the rows. Now run through your text and for each word, get the position for this word and it successor in the table, and count that up, something like:

table[words.indexOf(currentWord)][words.indexOf(nextWord)]++;

In the end your table will contain the frequencies of every word-word pair. Also, to find further help on your problem, it might help to search for bigrams, which is the common name for this problem.


Various hints:

  • You could read a file in directly by using

    Scanner sc = new Scanner(new File("file.name"));

  • You could put your so-called "stop words", i.e. "a", "an", "the" into a Set, such as a java.util.HashSet, and then simply test for it by saying something simple like

    if (stopWords.contains(word)) ...

  • For the data structure: This is fairly sophisticated for a "project 1"! Given pairs of words in variables called first and second, I guess what I would use is a HashMap keyed on words in first, and containing as values a second HashMap keyed on words in second. The values of the second hashmap would be the counts for that pair of words, stored as Integer values.

  • You need to watch out for the corner case where you're seeing a second word for the first time; in that case, you need to store in the second hashmap your second word and Integer.valueOf(1). Otherwise, you need to replace the value with an Integer that's 1 bigger than the previous one.

  • There's a way you can "cheat" a little and dramatically simplify your data structure: If you "glue" your first and second words together using a separator character, e.g.

    String key = first + "_" + second;

then you have a key that contains both words, and you only need a single hashmap to store keys and counts in. However, this makes for a little work later on, when you'll have to have a collection of first words (hint: you can store those in a Set as you're prcoessing the input) and split those keys up again (hint: Use String.split(key, "_")).

If you want your words to be automatically sorted in ascending order, you'll probably do well to use TreeMap rather than HashMap.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜