Picking the right data structure
I'm trying to teach myself java. I'm trying to write a program that takes a string with no spaces and separates it into words.
My plan of attack was to partition the dictionary based on word length, then walk through the string finding possible variations.
I ran into a problem with making my dictionary. I've read up on the various collections, and I thought that an array (of length 20 or so) holding HashS开发者_运维技巧ets would work best for me, but I can't figure out how to declare it. I think an array would be good because the index would represent the length, then a HashSet would be good because I could store the words as keys for fast look ups.
This is something that I could do in seconds in the scripting languages I'm most comfortable with, but I've spent about 5 hours reading up and trying to figure it out in Java. Historically speaking, this is evidence that I'm doing something fundamentally wrong. Can someone with more java acumen help get me going?
I don't see why you'd need an array of hashsets. Here's what I invision:
Set<String> dictionary = new HashSet<String>();
dictionary.add("One");
dictionary.add("Two");
dictionary.add("Three");
dictionary.add("Four");
And here is how I would use it. Note: don't read below unless you want an actual answer to the breaking-into-words problem. It may diminish how much learning you get. So only read it if you're okay with it being spoiled.
List<String> split(String sentence) {
List<String> words = new LinkedList<String>();
String word = ""; // StringBuilder actually is not orders faster in
// this case or I would advocate using it...
for(int i = 0; i < sentence.length(); i++) {
word += sentence.charAt(i); // creates a new String anyway, so StringBuilder
// is far less powerful
if(dictionary.contains(word) {
words.add(word);
word = "";
}
}
return words;
}
Some concerns:
Let us assume your sentences and words are all in lowercase, to avoid case sensitivity. Let us also assume your dictionary contains every common English word.
dictionary.add("this");
dictionary.add("is");
dictionary.add("a");
dictionary.add("test");
And run "thisisatest" and it will split it properly.
Now, keep in mind there are other words.
dictionary.add("i");
dictionary.add("sat");
dictionary.add("est");
These are all valid words. Running it will give you
"this" "i" "sat" "est"
In fact, by this logic, EVERY word that starts with i
or a
will end up getting missed. And that's bad. Especially for words like "apple"
You will get a
as the first word, then continue searching for "pple"
and words that start with "pple"
. That's going to cause a lot of problems!
Even if you can get around this problem, you're going to run into problems where the words are always valid.
Consider "thetreescare". Is it "the" "tree" "scare" or "the" "trees" "care". You are unable to make the differentiation - ever!
So the problem you have picked is a dousie, for sure!
If your only question is the syntax, then to create an array of 20 HashSets, the syntax would be:
HashSet[] mySets = new HashSet[20];
HashSet<String>[] mySets = new HashSet[20];
You probably want something like:
HashSet[] dictionary = new HashSet[20];
// Initialize all sets.
for (int i=0; i<dictionary.length; i++)
{
dictionary[i] = new HashSet<String>();
}
for (String word: words) // words is array or list with all possible words
{
dictionary[word.length()].add(word);
}
精彩评论