开发者

Python- find all subwords that can be found inside of a word

Ultimately, I want to find out which word in the English dictionary contains the most subwords that are at least开发者_运维知识库 three letters. I wrote this algorithm, but it's too slow to be useful. Wondering ways I can optimize it

def subWords(word):
    return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater

def checkDict(wordList, dictList):
    return set((word for word in wordList if word in dictList))

def main():
    dictList = [i.strip() for i in open('wordlist.txt').readlines()]
    allwords = list()
    maximum = (0, list())

    for dictWords in dictList:
        for i in range (len(dictWords)):
            for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
                allwords.append(a)

        if len(allwords) > maximum[0]:
            maximum = (len(allwords), allwords)

        print maximum
        allwords = list()

    print maximum 
main()


The major weakness of your algorithm is that, for every subword, you need to compare it to every other word in the dictionary. You don't need to do that, really - if your word begins with 'a', you don't really need to see if it matches words that begin with 'b'. If the next letter is 'c', then you don't really care to compare it to words that begin with 'd'. The question then becomes: "How do I implement this idea efficiently?"

To do this, we can create a tree to represent all of the words in the dictionary. We construct this tree by taking each word in the dictionary and extending the tree with it, and shading in the last node.

Python- find all subwords that can be found inside of a word

When we want to test if a subword is in this tree, we just go through that word letter by letter and use those letters to determine where to go next in the tree (starting at the top). If we find that we have nowhere to go, or that we land on a non-shaded tree node after going through the whole subword, then it's not a word. Otherwise, it is a word if we land on a shaded node. The effect of this is that we can search the entire dictionary at once, rather than one word at a time. The cost of this is, of course, a bit of set-up at the start, but that's not a great price to pay if you have a lot of words in the dictionary.

Well that's all pretty fantastic! Let's try implementing it:

class Node:
    def __init__( self, parent, valid_subword ):
        self.parent = parent
        self.valid_subword = valid_subword
        self.children = {}

    #Extend the tree with a new node
    def extend( self, transition, makes_valid_word ):
        next_node = None
        if transition in self.children:
            if makes_valid_word:
                self.children[transition].makes_valid_word = True
        else:
            self.children[transition] = Node( self, makes_valid_word )
        return self.children[transition]

def generateTree( allwords ):
  tree = Node( None, False )
    for word in allwords:
      makes_valid_word = False
      current_node = tree
      for i in range(len(word)):
        current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
  return tree

def checkDict( word, tree ):
    current_node = tree
    for letter in word:
        try:
            current_node = current_node.children[letter]
        except KeyError:
            return False

    return current_node.valid_subword

Then, later-on:

for word in allWords:
  for subword in subWords(word):
    checkDict(subword)
    #Code to keep track of the number of words found, like you already have

This algorithm allows you to check whether or not a word is in your dictionary in O(m) time, where m is the length of the longest word in the dictionary. Notice that this remains roughly constant for a dictionary containing an arbitrary number of words. Your original algorithm was O(n) per check, where n is the number of words in the dictionary.


1) Style and organization: it makes more sense to have a single function that generates all the subword of a word.

2) Style: the double parentheses are not needed to use set.

3) Performance (I hope): make a set from the words you're looking up, too; then you can use the built-in set intersection check.

4) Performance (almost certainly): don't do a manual loop to find the maximum element; use the built-in max. You can compare (length, elements) tuples directly; Python compares tuples by each pair of elements from beginning to last, as if each element were a letter in a string.

5) Performance (maybe): Ensure there are no 1- or 2-letter words in the dictionary to begin with, since they just get in the way.

6) Performance (sadly true): don't break everything up into a function.

7) Style: File I/O should use a with block to ensure proper cleanup of resources, and file iterators iterate over lines by default, so we can get a list of lines implicitly instead of having to call .readlines().

I end up with (not properly tested, except the 'fragments' expression):

def countedSubWords(word, dictionary): 
  fragments = set(
    word[i:j]
    for i in range(len(word)) for j in range(i+3, len(word)+1)
  )
  subWords = fragments.intersection(dictionary)
  return (len(subWords), subWords)


def main():
  with open('wordlist.txt') as words:
    dictionary = set(word.strip() for word in words if len(word.strip()) > 2)
    print max(countedSubWords(word, dictionary) for word in dictionary)


To explore basic Python, have a look at this function (basically a faster, polished, PEP8-happy version of what JBernardo and Karl Knechtel suggested):

def check_dict(word, dictionary): 
  """Return all subwords of `word` that are in `dictionary`."""
  fragments = set(word[i:j] 
                  for i in xrange(len(word) - 2) 
                  for j in xrange(i + 3, len(word) + 1))
  return fragments & dictionary

dictionary = frozenset(word for word in word_list if len(word) >= 3)
print max(((word, check_dict(word, dictionary)) for word in dictionary), 
          key=lambda (word, subwords): len(subwords)) # max = the most subwords

Outputs something like:

('greatgrandmothers',
set(['and', 'rand', 'great', 'her', 'mothers', 'moth', 'mother', 'others', 'grandmothers', 'grandmother', 'ran', 'other', 'greatgrandmothers', 'greatgrandmother', 'grand', 'hers', 'the', 'eat']))

for the word list from http://www.mieliestronk.com/wordlist.html.


Now I know you're not going for performance (the code above is already <1s for the standard English vocab of 58k words).

But in case you'd need to run this super fast, say in some inner loop :)

  • You'd want to avoid creating copies of all the substrings inside check_dict on the heap, that's the main performance killer.
  • You could do that by pointer arithmetics, representing the substring by pointer delimiters only (as opposed to a full object).
  • Use that substring to quickly determine whether it is a part of the valid vocabulary:
    • use the trie data structure, or its memory-friendly version PATRICIA tree
    • build the static trie once, from your dictionary, then do fast substring lookups
    • incrementally shuffle the pointers to explore all possible substrings, increasing a hit counter for valid words
    • this way you avoid any dynamic allocations (no strings, no sets), blazing fast!!
  • All this is not very relevant in Python, because such memory management is too low-level and you'd be better off not using Python for performance critical code anyway.


This runs in a few seconds. "sowpods.txt" has 267627 words of 3 or more letters If you are using Python2.5 or 2.6 you need to use at_least_3 = set(w for w in words if len(w)>=3)

words = open("sowpods.txt").read().split()

at_least_3 = {w for w in words if len(w)>=3}

def count_subwords(word):
    counter = 0
    for i in range(len(word)-2):
        for j in range(i+3,len(word)+1):
            candidate = word[i:j]
            if candidate in at_least_3:
                counter += 1
    return counter

for row in sorted((count_subwords(w),w) for w in at_least_3):
    print row

The highest number of subwords is 26

(26, 'CORESEARCHERS')
(26, 'FOREGONENESSES')
(26, 'METAGENETICALLY')
(26, 'PREPOSSESSIONS')
(26, 'SACRAMENTALISTS')
(26, 'WHOLESOMENESSES')


That's what you're asking or I'm missing something?

>>> words = ['a', 'asd', 'asdf', 'bla']
>>> [sum(1 for i in (a for a in words if a in b)) for b in words]
[1, 2, 3, 2]

That's the number of words in each word counting itself. If you don't want to count words with less than 3 chars, just remove them...

And, of course, it's O(n²)

Edit:

The question asks for all subwords but the code asks for only the one with more subwords... If you really want that 1st behavior, just remove the sum(...) part and make the genexp a list comprehension...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜