How can I improve my Trie implementation in terms of initialization?

2023-04-09 16:57 问答作者：

I'm trying to read in from a huge list of words and store them in a way that allows me to make quick retrievals later on. I first thought of using a trie and I'll admit my implementation is naive, it's basically nest开发者_C百科ed hash tables with each key being a different letter. Right now it takes forever to insert a word into the trie (running this program takes 20+ seconds), and I was wondering if anyone had any ideas as to what I could do to improve my insertion? This isn't for homework.

import string
import time

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert_word(self, word):
        current_node = self.root
        for letter in word:
            trie_node = current_node.get_node(letter)
            current_node = trie_node

class TrieNode:

    def __init__(self):
        self.data = {}

    def get_node(self, letter):
        if letter in self.data:
            return self.data[letter]
        else:
            new_trie_node = TrieNode()
            self.data[letter] = new_trie_node
            return new_trie_node

def main():
    start_time = time.time()
    trie = Trie()

    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")

    for word in word_list:
        trie.insert_word(word.lower())

    print time.time() - start_time, "seconds"


if __name__ == "__main__":
    main()

It is utterly pointless working on speeding up your trie initialisation before you have considered whether your search facility is working or not.

In the code that @unutbu referred to, why do you imagine it is mucking about with {'end':False} and pt['end']=True ?

Here is some test data for you:

words_to_insert = ['foo', 'foobar']
queries_expecting_true = words_to_insert
queries_expecting_false = "fo foe foob food foobare".split()

And here's another thought: You give no indication that you want anything more than the ability to determine whether a query word is present or not. If that is correct, you should consider benchmarking your DIY trie against a built-in set. Criteria: load speed (consider doing this from a pickle), query speed, and memory usage.

If you do want more retrieved than a bool, then substitute dict for set and re-read this answer.

If you do want to search for words in an input string, then you could consider the code referenced by @unutbu, with bugs fixed and some speedups in the find function (evaluate len(input) only once, use xrange instead of range (Python 2.x)) and the unnecessary TERMINAL: False entries removed:

TERMINAL = None # Marks the end of a word

def build(words, trie=None): # bugs fixed
    if trie is None:
        trie = {}
    for word in words:
        if not word: continue # bug fixed
        pt = trie # bug fixed
        for ch in word:
            pt = pt.setdefault(ch, {})
        pt[TERMINAL] = True
    return trie

def find(input, trie):
    len_input = len(input)
    results = []
    for i in xrange(len_input):
        pt = trie
        for j in xrange(i, len_input + 1):
            if TERMINAL in pt:
                results.append(input[i:j])
            if j >= len_input or input[j] not in pt:
                break
            pt = pt[input[j]]
    return results

or you could look at Danny Yoo's fast implementation of the Aho-Corasick algorithm.

There is an alternate implementation of Trie here.

Compare Trie.insert_word to build:

def build(words,trie={'end':False}):
    '''
    build builds a trie in O(M*L) time, where
        M = len(words)
        L = max(map(len,words))
    '''
    for word in words:
        pt=trie
        for letter in word:
            pt=pt.setdefault(letter, {'end':False})
        pt['end']=True
    return trie

With Trie, for each letter in word, insert_word calls current_node.get_node(letter). This method has an if and else block, and must instantiate a new TrieNode whenever the else block is reached, and a new key-value pair is then inserted into the self.data dict.

With build, the trie itself is just a dict. For each letter in word, there is simply one call to pt.setdefault(...). dict methods are implemented in C and are faster than implementing similar code in Python.

timeit shows about a 2x speed difference (in favor of build):

def alt_main():
    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")
    return build(word_list)


% python -mtimeit -s'import test' 'test.main()'
10 loops, best of 3: 1.16 sec per loop

% python -mtimeit -s'import test' 'test.alt_main()'
10 loops, best of 3: 571 msec per loop

继续阅读：data-structures optimization python trie

How can I improve my Trie implementation in terms of initialization?

更多精彩内容

精彩评论

最新问答

北医三院三代试管养囊一次费用是多少？贵不贵？？

下周全市中小学校恢复线下教学，如何让孩子收心准备开学？？

飞利浦液晶电视,不小心按了TV,现实无信号,要怎么样才能返回电视...？

如何治疗输卵管阻塞？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机如何样？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？