Python: String counting memory error

2023-04-05 18:56 问答作者：

Based on the suggestions I received in this forum, I am using the following code (example) to count strings.

phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey', 'turnpike. when i took exit 39 there was no', 'lake. i drove my car on muddy roads which turned my red', 'car into brown. driving on newje开发者_开发百科rsey turnpike can be confusing.']
text = " ".join(lines)
dict = {phrase: text.count(phrase) for phrase in phrase_words}

The desired output and the output of the example code is:

{'newjersey turnpike': 2, 'red car': 2, 'no lake': 1}

This code worked great on a text file which was less than 300MB. I used a text file of size 500MB + and received the following memory error:

    y=' '.join(lines)
MemoryError

How do I overcome this? Thanks for your help!

This algorithm needs only two lines in memory at a time. It assumes that no phrase will span three lines:

from itertools import tee, izip
from collections import defaultdict

def pairwise(iterable): # recipe from itertools docs
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)
d = defaultdict(int)
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey',
         'turnpike. when i took exit 39 there was no',
         'lake. i drove my car on muddy roads which turned my red',
         'car into brown. driving on newjersey turnpike can be confusing.']

for line1, line2 in pairwise(lines):
    both_lines= ' '.join((line1, line2))
    for phrase in phrase_words:
        # counts phrases in first line and those that span to the next
        d[phrase] += both_lines.count(phrase) - line2.count(phrase)
for phrase in phrase_words:
    d[phrase] += line2.count(phrase) # otherwise last line is not searched

You need to not try to do everything at once. Instead of loading a huge file into memory, and then parsing it, you should load the file a few (hundred) lines at a time, and try to find your strings within those lines.

As long as your chunks overlap at least max(len(phrase) for phrase in phrase_words) characters, you won't miss any hits.

So you could do something like:

text = ''
occurs = {}
overlap_size = max(len(phrase) for phrase in phrase_words)
for phrase in phrase_words:
    occurs[phrase] = 0

while lines:

    text += ' '.join(lines[:1000])
    lines = lines[100:]
    for phrase in phrase_words:
        # this makes sure we don't double count, and also gets all matches (probably)
        occurs[phrase] += text[overlap_size - len(phrase):].count(phrase)
    text = text[-1 * overlap_size:]

继续阅读：memory python string

Python: String counting memory error

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？