How to optimize this Python code (from ThinkPython, Exercise 10.10)
I'm working through Allen Downey's How To Think Like A Computer Scientist, and I've written what I believe to be a functionally correct solution to Exercise 10.10. But it took just over 10 hours (!) to run, so I'm wondering if I'm missing some really obvious and helpful optimization.
Here's the Exercise:
"Two words 'interlock' if taking alternating letters from each forms a new word. For example, 'shoe' and 'cold' interlock to form 'schooled'. Write a program that finds all pairs of words that interlock. Hint: Don't enumerate all pairs!"
(For these word-list problems, Downey has supplied a file with 113809 words. We may assume that these words are in a list, one word per item in the list.)
Here's my solution:
from bisect import bisect_left
def index(lst, target):
"""If target is in list, returns the index of target; otherwise returns None"""
i = bisect_left(lst, target)
if i != len(lst) and lst[i] == target:
return i
else:
return None
def interlock(str1, str2):
"Takes two strings of equal length and 'interlocks' them."
if len(str1) == len(str2):
lst1 = list(str1)
lst2 = list(str2)
result = []
for i in range(len(lst1)):
result.append(lst1[i])
result.append(lst2[i])
return ''.join(result)
else:
return None
def interlockings(word_lst):
"""Checks each pair of equal-length words to see if their interlocking is a word; prints each successful pair and the total number of successful pairs."""
total = 0
for i in range(1, 12): # 12 because max word length is 22
# to shorten the loops, get a sublist of words of equal length
sub_lst = filter(lambda(x): len(x) == i, word_lst)
f开发者_如何学Pythonor word1 in sub_lst[:-1]:
for word2 in sub_lst[sub_lst.index(word1)+1:]: # pair word1 only with words that come after word1
word1word2 = interlock(word1, word2) # interlock word1 with word2
word2word1 = interlock(word2, word1) # interlock word2 with word1
if index(lst, word1word2): # check to see if word1word2 is actually a word
total += 1
print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)
if index(lst, word2word1): # check to see if word2word1 is actually a word
total += 1
print "Word 2, %s, Word 1: %s, Interlock: %s" % (word2, word1, word2word1)
print "Total interlockings: ", total
The print statements are not the problem; my program found only 652 such pairs. The problem is the nested loops, right? I mean, even though I'm looping over lists that contain only words of the same length, there are (for example) 21727 words of length 7, which means my program has to check over 400 million "interlockings" to see if they're actual words---and that's just for the length-7 words.
So again, this code took 10 hours to run (and found no pairs involving words of length 5 or greater, in case you were curious). Is there a better way to solve this problem?
Thanks in advance for any and all insight. I'm aware that "premature optimization is the root of all evil"---and perhaps I've fallen into that trap already---but in general, while I can usually write code that runs correctly, I often struggle with writing code that runs well.
Do it the other way around: Iterate through all words and split them into two words by taking the odd and even letters. Then look up those two words in the dictionary.
As a side node, the two words that interlock must not necessarily have the same length -- the lengths might also differ by 1.
Some (untested) code:
words = set(line.strip() for line in open("words"))
for w in words:
even, odd = w[::2], w[1::2]
if even in words and odd in words:
print even, odd
Alternative definition for interlock:
import itertools
def interlock(str1, str2):
"Takes two strings of equal length and 'interlocks' them."
return ''.join(itertools.chain(*zip(str1, str2)))
An alternate version:
with open('words.txt') as inf:
words = set(wd.strip() for wd in inf)
word_gen = ((word, word[::2], word[1::2]) for word in words)
interlocked = [word for word,a,b in word_gen if a in words and b in words]
On my machine this runs in 0.16 seconds and returns 1254 words.
Edit: as pointed out by @John Machin at Why is this program faster in Python than Objective-C? this can be further improved by lazy execution (only perform the second slice if the first results in a valid word):
with open('words.txt') as inf:
words = set(wd.strip() for wd in inf)
interlocked = [word for word in words if word[::2] in words and word[1::2] in words]
This drops execution time by a third, to 0.104 seconds.
An important thing is your index
function: It's the function that runs more than any function. When you don't need the index of the found word, why define a function to find that index?
if word1word2 in lst:
is enough instead of if index(lst, word1word2):
.
The same for if index(lst, word2word1):
.
OK. bisection works really faster than the in
syntax. To improve the speed a bit more, i suggest using the bisect_left
function directly in your interlockings
function.
For example instead of:
if index(lst, word1word2): # check to see if word1word2 is actually a word
total += 1
print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)
Use:
q = bisect_left(lst, word1word2)
if q != len(lst) and lst[q] == word1word2:
total += 1
print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)
A very slight improvement in speed.
精彩评论