How can I improve my Trie implementation in terms of initialization?
I'm trying to read in from a huge list of words and store them in a way that allows me to make quick retrievals later on. I first thought of using a trie and I'll admit my implementation is naive, it's basically nest开发者_C百科ed hash tables with each key being a different letter. Right now it takes forever to insert a word into the trie (running this program takes 20+ seconds), and I was wondering if anyone had any ideas as to what I could do to improve my insertion? This isn't for homework.
import string
import time
class Trie:
def __init__(self):
self.root = TrieNode()
def insert_word(self, word):
current_node = self.root
for letter in word:
trie_node = current_node.get_node(letter)
current_node = trie_node
class TrieNode:
def __init__(self):
self.data = {}
def get_node(self, letter):
if letter in self.data:
return self.data[letter]
else:
new_trie_node = TrieNode()
self.data[letter] = new_trie_node
return new_trie_node
def main():
start_time = time.time()
trie = Trie()
with open('/usr/share/dict/words', 'r') as dictionary:
word_list = dictionary.read()
word_list = word_list.split("\n")
for word in word_list:
trie.insert_word(word.lower())
print time.time() - start_time, "seconds"
if __name__ == "__main__":
main()
It is utterly pointless working on speeding up your trie initialisation before you have considered whether your search facility is working or not.
In the code that @unutbu referred to, why do you imagine it is mucking about with {'end':False}
and pt['end']=True
?
Here is some test data for you:
words_to_insert = ['foo', 'foobar']
queries_expecting_true = words_to_insert
queries_expecting_false = "fo foe foob food foobare".split()
And here's another thought: You give no indication that you want anything more than the ability to determine whether a query word is present or not. If that is correct, you should consider benchmarking your DIY trie against a built-in set
. Criteria: load speed (consider doing this from a pickle), query speed, and memory usage.
If you do want more retrieved than a bool
, then substitute dict
for set
and re-read this answer.
If you do want to search for words in an input string, then you could consider the code referenced by @unutbu, with bugs fixed and some speedups in the find
function (evaluate len(input)
only once, use xrange
instead of range
(Python 2.x)) and the unnecessary TERMINAL: False
entries removed:
TERMINAL = None # Marks the end of a word
def build(words, trie=None): # bugs fixed
if trie is None:
trie = {}
for word in words:
if not word: continue # bug fixed
pt = trie # bug fixed
for ch in word:
pt = pt.setdefault(ch, {})
pt[TERMINAL] = True
return trie
def find(input, trie):
len_input = len(input)
results = []
for i in xrange(len_input):
pt = trie
for j in xrange(i, len_input + 1):
if TERMINAL in pt:
results.append(input[i:j])
if j >= len_input or input[j] not in pt:
break
pt = pt[input[j]]
return results
or you could look at Danny Yoo's fast implementation of the Aho-Corasick algorithm.
There is an alternate implementation of Trie here.
Compare Trie.insert_word
to build
:
def build(words,trie={'end':False}):
'''
build builds a trie in O(M*L) time, where
M = len(words)
L = max(map(len,words))
'''
for word in words:
pt=trie
for letter in word:
pt=pt.setdefault(letter, {'end':False})
pt['end']=True
return trie
With Trie
, for each letter
in word
, insert_word
calls current_node.get_node(letter)
. This method has an if
and else
block, and must instantiate a new TrieNode
whenever the else
block is reached, and a new key-value pair is then inserted into the self.data
dict.
With build
, the trie itself is just a dict. For each letter
in word
, there is simply one call to pt.setdefault(...)
. dict
methods are implemented in C and are faster than implementing similar code in Python.
timeit
shows about a 2x speed difference (in favor of build
):
def alt_main():
with open('/usr/share/dict/words', 'r') as dictionary:
word_list = dictionary.read()
word_list = word_list.split("\n")
return build(word_list)
% python -mtimeit -s'import test' 'test.main()'
10 loops, best of 3: 1.16 sec per loop
% python -mtimeit -s'import test' 'test.alt_main()'
10 loops, best of 3: 571 msec per loop
精彩评论