Database Compression in Python

2023-02-01 17:12 问答作者：

I have hourly logs like

user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined

I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all th开发者_运维知识库e logs.

My end result is I want to have a log look like

user1:added comment
user2:log out
user3:joined

Now my first attempt on a small scale was to just do a dict like

log['user1'] = "added comment"

Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?

If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.

>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True

You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:

s = set()

# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them...  There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()

for line in lines:
    line = line.strip()
    user, op = line.split(':')
    if not user in s:
         print line
         s.add(user)

The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.

This sounds like the perfect kind of problem for a Map/Reduce solution. See:

http://en.wikipedia.org/wiki/MapReduce
Hadoop

for example.

Its pretty to easy to mock up the data structure to see how much memory it would take.

Something like this where you could change gen_string to generate data that would approximate the messages.

import random
from commands import getstatusoutput as gso

def gen_string():
     return str(random.random())

 d = {}
 for z in range(10**6):
     d[gen_string()] = gen_string()

print gso('ps -eo %mem,cmd |grep test.py')[1]

On a one gig netbook:

  0.4 vim test.py
  0.1 /bin/bash -c time python test.py
 11.7 /usr/bin/python2.6 test.py
  0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
  0.0 grep test.py

   real    0m26.325s
   user    0m25.945s
   sys     0m0.377s

... So its using about 10% of 1 gig for 100,000 records

But it would also depend on how much data redundancy you have ...

Thanks to @Ignacio for intern() -

def procLog(logName, userDict):
    inf = open(logName, 'r')
    for ln in inf.readlines():
        name,act = ln.split(':')
        userDict[name] = intern(act)
    inf.close()
    return userDict

def doLogs(logNameList):
    userDict = {}
    for logName in logNameList:
        userDict = procLog(logName, userDict)
    return userDict

def writeOrderedLog(logName, userDict):
    keylist = userDict.keys()
    keylist.sort()
    outf = open(logName,'w')
    for k in keylist:
        outf.write(k + ':' + userDict[k])
    outf.close()

def main():
    mylogs = ['log20101214', 'log20101215', 'log20101216']
    d = doLogs(mylogs)
    writeOrderedLog('cumulativeLog', d)

the question, then, is how much memory this will consume.

def makeUserName():
    ch = random.choice
    syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
    # 22**5 is about 5.1 million potential names
    return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)

ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
    for j in xrange(8000):
        d[makeUserName()] = ch(states)
    t.append( (len(d), sys.getsizeof(d)) )

which results in

Database Compression in Python

(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?

Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.

继续阅读：python

Database Compression in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？