Accessing a Dynamically Generated Nested Dictionary
The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?
My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy
Positions
1
2
3
4
5
6
For example, using a sample string "This is a test" the tree would look like
t
h
i
s
e
s
t
i
s
a
Notice the 't' in the first level is there only once. The first dictionary would contain the keys {'t','i','a'}. There would be three second level dictionaries containing the keys {'h'}{'e'}{'s'}.
This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is ho开发者_运维问答w do I dynamically build a dictionary like this, specifically accessing the correct level
So far I have something to the effect of
def addTerm(self, term):
current_level = 0;
for character in list(term):
character = character.lower()
if re.match("[a-z]",character):
self.tree[character] = {}
current_level += 1
I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like tree = [{}, None]
where None indicates if the current node is the end of a word.
Your addTerm method could then be something like:
def addTerm(self, term):
node = self.tree
for c in term:
c = c.lower()
if re.match("[a-z]",c):
node = node[0].setdefault(c,[{},None])
node[1] = term
You could set node[1]
to True if you don't care about what word is at the node.
Searching if a word is in the trie would be something like
def findTerm(self, term):
node = self.tree
for c in term:
c = c.lower()
if re.match("[a-z]",c):
if c in node[0]:
node = node[0][c]
else:
return False
return node[1] != None
I read an article about doing a very similar thing recently by John Resig. The implementation is Javascript, but using the idea and translating the code should be quite easy to do. At least it will give you some corner-case issues to consider in your implementation.
http://ejohn.org/blog/javascript-trie-performance-analysis/
精彩评论