Best way to store and retrieve a DAWG data structure for fast loading
I have a 500k+ wordlist that I loaded it into a DAWG data structure. My app is for mobile phones. I of course don't want to repeat all the conversion steps to load this wordlist into a DAWG every time, since it would take to much storage space to have the wordlist on the phone and to much time to load it 开发者_如何学Pythoninto a DAWG every time. So, I am looking for a way to store the data in my DAWG to a file or DB in a format that will both conserve space and allow for me to quickly load it back into my DAWG data structure.
I received one suggestion that I could store each node in a SQLite DB, but I am not sure how that would exactly work and if I did that how would I retrieve it quickly. I certainly wouldn't want to run lots of queries. Would some other type of storage method be better? I also received suggestions of creating a serialised file or to store it as a bitmap.
You can basically do a memory dump, just use offsets instead of pointers (in Java terms, put all nodes in an array, and use array index to refer to a node).
500k doesn't seem like amount that would be problematic for modern phones, especially that DAWG is quite efficient already. If you mmap the file, you would be able to work with the data structure even if it doesn't fit in memory.
Did you tried to reduce the wordlist? Are you saving only the word stam if possible for your application?
Other hand: You never should rebuild the data structure because the wordlist is constant. Try do use a memory dump like suggusted. Use mmap for the file, java serialization or pickle pickle technics to load a ready-made data structure into your memory.
I guess, you are using DAWG for fast searching some word in a dictionary. DAWG has O(LEN)
search complexity.
Many years ago, I developed J2ME app and faced with the same problem. But in that times phones definetely couldn't provide such RAM amount of RAM memory, to store 500K+ strings) The solution I used is the following:
- Read all words, sort them, put in some file line by line and for
each word precompute
skipBytes
. - number of bytes before this word. Computing skipBytes is trivial. pseudocode isskipBytes[0]=words[0].bytesLen; for i=1 to n skipBytes[i]=skipBytes[i-1]+words[i].getBytesLength
- When app starts read 500k skipBytes to some int array. It is much smaller that 500K strings)
- Searching word in a dict - binary search. Imagine that you are perfoming it on sorted array but, instead of making
array[i]
you make something likeRandomAccessFile.read(skipBytes[i])
. Google Java Random Access Files my pseucode of course wrong it's just direction.
Complexity - O(LEN*LOG(N))
= LOG of Binary search and comparing strings is linear complexity. LOG(500000)~19, LEN ~ average word leng in worst case is 50 (fantastic upper bound), so search operation is still very fast, only ~1000 operation it will be done in microseconds. Advantage - small memory usage.
I should mention, that in case of web app when many users perform searhing, LOG(N)
becomes important, but if your app provides service for only one person LOG(500000) doesn't change much if it performed not inside a loop)
精彩评论