开发者

Python memory leak while populating a list - how to fix it?

I have a piece of code that looks like this:

downloadsByExtensionCount = defaultdict(int)
downloadsByExtensionList = []
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log']


for logFile in logFiles:
    log = open(logFile, 'r', encoding='utf-8')
    logLines = log.readlines()

    for logLine in logLines:
        date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ")

        downloadsByExtensionCount[cs_uri_stem] += 1
        downloadsByExtensionList.append([date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent])

each of these four files is around 150MB and each one has around 60 000 - 80 000 lines in it.

I started making the script using only one of these files because it was faster for me to test the functionality that way, but now that i have all the logic and functionality I of course tried running it on all four log files at once. What I get when the script starts fetching data from the fourth file is this:

Traceback (most recent call last):
 开发者_运维知识库   File "C:\Python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

So - I took a look at how much memory this thing is consuming, and this is what I found:

Script reads first three files and gets to somewhere around 1800-1950MB, then it starts reading the last file goes up by 50-100MB more, and then I get the error. I tried runnin the script with the last line (append) is commented out and then it gets up to around 500MB total.

So, what am I doing wrong? These four files combined are around 600MB total, and the script consumes around 1500 for populating the list with only three out of four files which

I don't really understand why.. How can I improve this? Thank you.


log.readlines() reads the file contents into a list of lines. You can iterate over the file directly to avoid that extra list.

downloadsByExtensionCount = defaultdict(int)
downloadsByExtensionList = []
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log']


for logFile in logFiles:
    # closes the file after the block
    with open(logFile, 'r', encoding='utf-8') as log:
        # just iterate over the file
        for logLine in log:
            date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ")
            downloadsByExtensionCount[cs_uri_stem] += 1
            # tuples are enough to store the data
            downloadsByExtensionList.append((date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent))


Iterate directly through the file content:

for logFile in logFiles:

    log = open(logFile, 'r', encoding='utf-8')
    for logLine in log:
        ...
    log.close()

Use tuple instead of list:

>>> sys.getsizeof(('1','2','3'))
80
>>> sys.getsizeof(['1','2','3'])
96


You can use sqlite3 built-in module for data manipulation. You can also supply the special name ":memory:" insted "c:/temp/example" to create a database in RAM. If not stored in RAM limit is hard disk free space.

import sqlite3
from collections import defaultdict

downloadsByExtensionCount = defaultdict(int)
# downloadsByExtensionList = []
logFiles = ['file1.log', 'file2.log', 'file3.log', 'file4.log']


conn = sqlite3.connect('c:/temp/example')
c = conn.cursor()
# Create table
c.execute('create table if not exists logs(date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent)')

for logFile in logFiles:
    try:
        log = open(logFile, 'rb')#, encoding='utf-8')
    except IOError, e:
        continue

    logLines = log.readlines()

    for logLine in logLines:
        date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent = logLine.split(" ")

        downloadsByExtensionCount[cs_uri_stem] += 1
        c.execute(
            'insert into logs(date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent) values(?,?,?,?,?,?,?)', 
            (date, time, c_ip, cs_method, s_port, cs_uri_stem, cs_user_agent)
            )

conn.commit()
conn.close()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜