Python- find all subwords that can be found inside of a word
Ultimately, I want to find out which word in the English dictionary contains the most subwords that are at least开发者_运维知识库 three letters. I wrote this algorithm, but it's too slow to be useful. Wondering ways I can optimize it
def subWords(word):
return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater
def checkDict(wordList, dictList):
return set((word for word in wordList if word in dictList))
def main():
dictList = [i.strip() for i in open('wordlist.txt').readlines()]
allwords = list()
maximum = (0, list())
for dictWords in dictList:
for i in range (len(dictWords)):
for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
allwords.append(a)
if len(allwords) > maximum[0]:
maximum = (len(allwords), allwords)
print maximum
allwords = list()
print maximum
main()
The major weakness of your algorithm is that, for every subword, you need to compare it to every other word in the dictionary. You don't need to do that, really - if your word begins with 'a', you don't really need to see if it matches words that begin with 'b'. If the next letter is 'c', then you don't really care to compare it to words that begin with 'd'. The question then becomes: "How do I implement this idea efficiently?"
To do this, we can create a tree to represent all of the words in the dictionary. We construct this tree by taking each word in the dictionary and extending the tree with it, and shading in the last node.
When we want to test if a subword is in this tree, we just go through that word letter by letter and use those letters to determine where to go next in the tree (starting at the top). If we find that we have nowhere to go, or that we land on a non-shaded tree node after going through the whole subword, then it's not a word. Otherwise, it is a word if we land on a shaded node. The effect of this is that we can search the entire dictionary at once, rather than one word at a time. The cost of this is, of course, a bit of set-up at the start, but that's not a great price to pay if you have a lot of words in the dictionary.
Well that's all pretty fantastic! Let's try implementing it:
class Node:
def __init__( self, parent, valid_subword ):
self.parent = parent
self.valid_subword = valid_subword
self.children = {}
#Extend the tree with a new node
def extend( self, transition, makes_valid_word ):
next_node = None
if transition in self.children:
if makes_valid_word:
self.children[transition].makes_valid_word = True
else:
self.children[transition] = Node( self, makes_valid_word )
return self.children[transition]
def generateTree( allwords ):
tree = Node( None, False )
for word in allwords:
makes_valid_word = False
current_node = tree
for i in range(len(word)):
current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
return tree
def checkDict( word, tree ):
current_node = tree
for letter in word:
try:
current_node = current_node.children[letter]
except KeyError:
return False
return current_node.valid_subword
Then, later-on:
for word in allWords:
for subword in subWords(word):
checkDict(subword)
#Code to keep track of the number of words found, like you already have
This algorithm allows you to check whether or not a word is in your dictionary in O(m) time, where m is the length of the longest word in the dictionary. Notice that this remains roughly constant for a dictionary containing an arbitrary number of words. Your original algorithm was O(n) per check, where n is the number of words in the dictionary.
1) Style and organization: it makes more sense to have a single function that generates all the subword of a word.
2) Style: the double parentheses are not needed to use set
.
3) Performance (I hope): make a set
from the words you're looking up, too; then you can use the built-in set intersection check.
4) Performance (almost certainly): don't do a manual loop to find the maximum element; use the built-in max
. You can compare (length, elements) tuples directly; Python compares tuples by each pair of elements from beginning to last, as if each element were a letter in a string.
5) Performance (maybe): Ensure there are no 1- or 2-letter words in the dictionary to begin with, since they just get in the way.
6) Performance (sadly true): don't break everything up into a function.
7) Style: File I/O should use a with
block to ensure proper cleanup of resources, and file iterators iterate over lines by default, so we can get a list of lines implicitly instead of having to call .readlines()
.
I end up with (not properly tested, except the 'fragments' expression):
def countedSubWords(word, dictionary):
fragments = set(
word[i:j]
for i in range(len(word)) for j in range(i+3, len(word)+1)
)
subWords = fragments.intersection(dictionary)
return (len(subWords), subWords)
def main():
with open('wordlist.txt') as words:
dictionary = set(word.strip() for word in words if len(word.strip()) > 2)
print max(countedSubWords(word, dictionary) for word in dictionary)
To explore basic Python, have a look at this function (basically a faster, polished, PEP8-happy version of what JBernardo and Karl Knechtel suggested):
def check_dict(word, dictionary):
"""Return all subwords of `word` that are in `dictionary`."""
fragments = set(word[i:j]
for i in xrange(len(word) - 2)
for j in xrange(i + 3, len(word) + 1))
return fragments & dictionary
dictionary = frozenset(word for word in word_list if len(word) >= 3)
print max(((word, check_dict(word, dictionary)) for word in dictionary),
key=lambda (word, subwords): len(subwords)) # max = the most subwords
Outputs something like:
('greatgrandmothers',
set(['and', 'rand', 'great', 'her', 'mothers', 'moth', 'mother', 'others', 'grandmothers', 'grandmother', 'ran', 'other', 'greatgrandmothers', 'greatgrandmother', 'grand', 'hers', 'the', 'eat']))
for the word list from http://www.mieliestronk.com/wordlist.html.
Now I know you're not going for performance (the code above is already <1s for the standard English vocab of 58k words).
But in case you'd need to run this super fast, say in some inner loop :)
- You'd want to avoid creating copies of all the substrings inside
check_dict
on the heap, that's the main performance killer. - You could do that by pointer arithmetics, representing the substring by pointer delimiters only (as opposed to a full object).
- Use that substring to quickly determine whether it is a part of the valid vocabulary:
- use the trie data structure, or its memory-friendly version PATRICIA tree
- build the static trie once, from your dictionary, then do fast substring lookups
- incrementally shuffle the pointers to explore all possible substrings, increasing a hit counter for valid words
- this way you avoid any dynamic allocations (no strings, no sets), blazing fast!!
- All this is not very relevant in Python, because such memory management is too low-level and you'd be better off not using Python for performance critical code anyway.
This runs in a few seconds. "sowpods.txt" has 267627 words of 3 or more letters
If you are using Python2.5 or 2.6 you need to use at_least_3 = set(w for w in words if len(w)>=3)
words = open("sowpods.txt").read().split()
at_least_3 = {w for w in words if len(w)>=3}
def count_subwords(word):
counter = 0
for i in range(len(word)-2):
for j in range(i+3,len(word)+1):
candidate = word[i:j]
if candidate in at_least_3:
counter += 1
return counter
for row in sorted((count_subwords(w),w) for w in at_least_3):
print row
The highest number of subwords is 26
(26, 'CORESEARCHERS')
(26, 'FOREGONENESSES')
(26, 'METAGENETICALLY')
(26, 'PREPOSSESSIONS')
(26, 'SACRAMENTALISTS')
(26, 'WHOLESOMENESSES')
That's what you're asking or I'm missing something?
>>> words = ['a', 'asd', 'asdf', 'bla']
>>> [sum(1 for i in (a for a in words if a in b)) for b in words]
[1, 2, 3, 2]
That's the number of words in each word counting itself. If you don't want to count words with less than 3 chars, just remove them...
And, of course, it's O(n²)
Edit:
The question asks for all subwords but the code asks for only the one with more subwords... If you really want that 1st behavior, just remove the sum(...)
part and make the genexp a list comprehension...
精彩评论