Python- need fast algorithm that removes all words in a file that are derivatives in other words
We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):
wfile = open("wordlist.txt", "r+")
I want to create a new file that contains only words that are not derivatives of other, smaller words. Fo开发者_StackOverflowr example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.
I tried this implementation:
def root_words(wordlist):
result = []
base = wordlist[1]
for word in wordlist:
if not word.startswith(base):
result.append(base)
print base
base=word
result.append(base)
return result;
def main():
wordlist = []
wfile = open("wordlist.txt", "r+")
for line in wfile:
wordlist.append(line[:-1])
wordlist = root_words(wordlist)
newfile = open("newwordlist.txt", "r+")
newfile.write(wordlist)
But it always froze my computer. Any solutions?
I would do something like this:
def bases(words):
base = next(words)
yield base
for word in words:
if word and not word.startswith(base):
yield word
base = word
def get_bases(infile, outfile):
with open(infile) as f_in:
words = (line.strip() for line in f_in)
with open(outfile, 'w') as f_out:
f_out.writelines(word + '\n' for word in bases(words))
This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.
$ time python words.py
real 0m0.233s
user 0m0.180s
sys 0m0.012s
It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.
Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)
One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.
The following example treats the file as it is read without pre-loading stuff in memory.
def root_words(f,out):
result = []
base = f.readline()
for word in f:
if not word.startswith(base):
out.write(base + "\n")
base=word
out.write(base + "\n")
def main():
wfile = open("wordlist.txt", "r+")
newfile = open("newwordlist.txt", "w")
root_words(wfile,newfile)
wfile.close()
newfile.close()
Memory complexity of this solution is O(1) since the variable base
is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.
since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)
res = [" "]
with open("wordlist.txt","r") as f:
for line in f:
tmp = line.strip()
if tmp.startswith(res[-1]):
pass
else:
res.append(tmp)
with open("newlist.txt","w") as f:
f.write('\n'.join(res[1:]))
精彩评论