Accessing a Dynamically Generated Nested Dictionary

2023-02-20 05:45 问答作者：

The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?

My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy

Positions

For example, using a sample string "This is a test" the tree would look like

t
 h
  i
   s
 e
  s
   t
i
 s
a

Notice the 't' in the first level is there only once. The first dictionary would contain the keys {'t','i','a'}. There would be three second level dictionaries containing the keys {'h'}{'e'}{'s'}.

This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is ho开发者_运维问答w do I dynamically build a dictionary like this, specifically accessing the correct level

So far I have something to the effect of

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1

I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like tree = [{}, None] where None indicates if the current node is the end of a word.

Your addTerm method could then be something like:

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

You could set node[1] to True if you don't care about what word is at the node.

Searching if a word is in the trie would be something like

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None

I read an article about doing a very similar thing recently by John Resig. The implementation is Javascript, but using the idea and translating the code should be quite easy to do. At least it will give you some corner-case issues to consider in your implementation.

http://ejohn.org/blog/javascript-trie-performance-analysis/

继续阅读：matrix nested python

Accessing a Dynamically Generated Nested Dictionary

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？