Python- need fast algorithm that removes all words in a file that are derivatives in other words

2023-02-06 08:32 问答作者：

We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):

 wfile = open("wordlist.txt", "r+")

I want to create a new file that contains only words that are not derivatives of other, smaller words. Fo开发者_StackOverflowr example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.

I tried this implementation:

def root_words(wordlist):
    result = []
    base = wordlist[1]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            print base
            base=word
    result.append(base)
    return result;


def main():
    wordlist = []
    wfile = open("wordlist.txt", "r+")

    for line in wfile:
        wordlist.append(line[:-1])

    wordlist = root_words(wordlist)
    newfile = open("newwordlist.txt", "r+")    
    newfile.write(wordlist)

But it always froze my computer. Any solutions?

I would do something like this:

def bases(words):
    base = next(words)
    yield base
    for word in words:
        if word and not word.startswith(base):
            yield word
            base = word


def get_bases(infile, outfile):
    with open(infile) as f_in:
        words = (line.strip() for line in f_in)
        with open(outfile, 'w') as f_out:
            f_out.writelines(word + '\n' for word in bases(words))

This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.

$ time python words.py

real        0m0.233s
user        0m0.180s
sys         0m0.012s

It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.

Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)

One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.

The following example treats the file as it is read without pre-loading stuff in memory.

def root_words(f,out):
    result = []
    base = f.readline()
    for word in f:
        if not word.startswith(base):
            out.write(base + "\n")
            base=word
    out.write(base + "\n")

def main():
    wfile = open("wordlist.txt", "r+")
    newfile = open("newwordlist.txt", "w")
    root_words(wfile,newfile)
    wfile.close()
    newfile.close()

Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.

since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)

res = [" "]

with open("wordlist.txt","r") as f:
    for line in f:
        tmp = line.strip()
        if tmp.startswith(res[-1]):
            pass
        else:
            res.append(tmp)

with open("newlist.txt","w") as f:
    f.write('\n'.join(res[1:]))

继续阅读：algorithm io list python

Python- need fast algorithm that removes all words in a file that are derivatives in other words

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？