Prefix tree problem
I need to manage with nearly 1,300,000 words (some groups of words are similar). I'm doing something like small vocabulary, where each word has its description. The fast search through vocabulary is need. So I decided to use prefix tree. Firstly the preffix tree is needed to build up (it's a slow process, I know that), after this fast surf through vocabulary might be organised.
But my problem - that preffix tree is building extremly slow (first 300,000 words builds up fast, but building of tail is very very slow, so slow that I couldn't wait until tree is build!!).
Here is my prefix tree class:
public class InverseVocabularyTree implements Serializable
{
private HashMap<Character, InverseVocabularyTree> childs;
private String description;
public InverseVocabularyTree() {
childs=new HashMap<Character, InverseVocabularyTree>();
}
public void addWord(String word, String description){
InverseVocabularyTree tr=this;
InverseVocabularyTree chld=this;
char[] letters=word.toCharArray();
for (int i=word.length()-1; i>=0; i--) {
if (!tr.childs.containsKey(letters[i]))
{
for(int j=i; j>=0; j--) //a small optimisation..
{
chld=new InverseVocabularyTree();
tr.childs.put(letters[j], chld);
tr=chld;
}
break;
}
else
tr=tr.childs.get(letters[i]);
}
tr.description=description;
return;
}
public HashMap<Character, InverseVocabularyTree> getChilds() {
return childs;
}
public String[] getRemovableBasicParts() {
return removableBasicParts;
}
public LinkedList<String[]> getAllRemovableBasicParts() {
LinkedList<String[]> ret=new LinkedList<String[]>();
if (removableBasi开发者_StackOverflowcParts!=null)
ret.add(removableBasicParts);
if (childs.keySet().isEmpty())
return ret;
for(char c : childs.keySet())
ret.addAll(childs.get(c).getAllRemovableBasicParts());
return ret;
}
}
So, has anybody some ideas or advices how to optimize in this situation?
I would just use a NavigableMap or similar Set if you don't need a value. Say you need to search for words startign with "abc" you just need to do
NavigableMap<String, Boolean> wordmap = new TreeMap<String, Boolean>();
Random random = new Random(1);
for(int i=0;i<10*1000*1000;i++)
wordmap.put(Long.toString(Math.abs(random.nextLong()), 36).substring(1), true);
String prefix = "abcd";
for (String word : wordmap.subMap(prefix, prefix+"\uffff").keySet()) {
System.out.println(word + " starts with " + prefix);
}
// or
for (String word : wordmap.tailMap(prefix).keySet()) {
if (!word.startsWith(prefix)) break;
System.out.println(word + " starts with " + prefix);
}
This uses 1GB on my machine for 10 million entries and prints
abcd0krpbk1 starts with abcd
abcd7xi05pe starts with abcd
abcdlw4pwfl starts with abcd
EDIT: based on the feed back I would suggest something like following approach.
// keys stored in reverse order of the original string.
NavigableMap<String, Boolean> wordmap
String search = "dcba";
// retains hte order keys were added.
Map<String, Boolean> results = new LinkedHashMap<String, Boolean>();
for(int i=search.size();i>=1;i--) {
String s = search.substring(0, i);
results.putAll(wordmap.subMap(s, s+'\uFFFF')); // ignores duplicates
}
The results will have the combined of all the searches in order they were added, from most specific to least specific. }
Assuming that the problem is that after a few hundred thousand words your tree is getting too tall you could try to use certain commonly encountered bi-grams or tri-grams instead of single letters for a couple of nodes in order to make it a bit shorter. For example if you have a lot of words ending in "ing" instead of having one node for g who has a child for n who has a child for i you could create a single node for ing. Of course how good this will work depends on your vocabulary and you'll probably need to do some analysis to find appropriate bi-, tri-grams to use.
In general, since you say you have checked garbage collection I think it would be usefull if you could find out if there is a specific tree size after which your application starts slowing down or of the problem is something entirely different. Having a better view of what exactly the problem is could give you new ideas on how to solve it.
You are creating at least one HashMap for every word (often more) - so if you have really many different words, you run out of memory. Don't explicitly call System.gc
, instead observer your program with jconsole
or a similar profiler tool.
I suppose that after your first 300000 words simply the memory is nearly full, and your program spends most of its time trying to get more space. If this is the case, try to give your program more memory (with the -Xmx
option).
精彩评论