Saving huge bigram dictionary to file using pickle

2022-12-18 04:37 问答作者：

a friend of mine wrote this little progam. the textFile is 1.2GB in size (7 years worth of newspapers). He successfully manages to create the dictionary but he cannot write it to a file using pickle(program hangs).

import sys
import string
import cPickle as pickle

biGramDict = {}

textFile = open(str(sys.argv[1]), 'r')
biGramDictFile = open(str(sys.argv[2]), 'w')


for line in textFile:
   if (line.find('<s>')!=-1):
      old = None
      for line2 in textFile:
         if (line2.find('</s>')!=-1):
            break
         else:
            line2=line2.strip()
            if line2 not in string.punctuation:
               if old != None:
                  if old not in biGramDict:
                     biGramDict[old] = {}
                  if line2 not in biGramDict[old]:
                     biGramDict[old][line2] = 0
                  biGramDict[old][line2]+=1
               old=line2

textFile.close()

print "going to pickle..."    
pickle.dump(biGramDict, biGramDictFile,2)

print "pickle done. now load it..."

biGramDictFile.close()
biGramDictFile = open(str(sys.argv[2]), 'r')

newBiGramDict = pickle.load(biGramDictFile)

thanks in advance.

EDIT

for anyone interested i will briefly explain what this program does. assuming you have a file formated roughly like this:

开发者_StackOverflow

<s>
Hello
,
World
!
</s>
<s>
Hello
,
munde
!
</s>
<s>
World
domination
.
</s>
<s>
Total
World
domination
!
</s>

<s> are sentences separators.
one word per line.

a biGramDictionary is generated for later use.

something like this:

{
 "Hello": {"World": 1, "munde": 1}, 
 "World": {"domination": 2},
 "Total": {"World": 1},
}

hope this helps. right now the strategy changed to using mysql because sqlite just wasn't working (probably because of the size)

Pickle is only meant to write complete (small) objects. Your dictionary is a bit large to even hold in memory, you'd better use a database instead so you can store and retrieve entries one by one instead of all at once.

Some good and easily integratable singe-file database formats you can use from Python are SQLite or one of the DBM variants. The last one acts just like a dictionary (i.e. you can read and write key/value-pairs) but uses the disk as storage rather than 1.2 GBs of memory.

Do you really need the whole data in memory? You could split it in naive ways like one file for each year o each month if you want the dictionary/pickle approach.

Also, remember that the dictionaries are not sorted, you can have problems having to sort that ammount of data. In case you want to search or sort the data, of course...

Anyway, I think that the database approach commented before is the most flexible one, specially on the long run...

One solution is to use buzhug instead of pickle. It's a pure Python solution, and retains very Pythonic syntax. I think of it as the next step up from shelve and their ilk. It will handle the data sizes you're talking about. Its size limit is 2 GB per field (each field is stored in a separate file).

If your really, really want to use a dictionary like semantics, try SQLAlchemy's associationproxy. The following (rather long) piece of code translates your dictionary into Key,Value-Pairs in the entries-Table. I do not know how SQLAlchemy copes with your big dictionary, but SQLite should be able to handle it nicely.

from sqlalchemy import create_engine, MetaData
from sqlalchemy import Table, Column, Integer, ForeignKey, Unicode, UnicodeText
from sqlalchemy.orm import mapper, sessionmaker, scoped_session, Query, relation
from sqlalchemy.orm.collections import column_mapped_collection
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.schema import UniqueConstraint

engine = create_engine('sqlite:///newspapers.db')

metadata = MetaData()
metadata.bind = engine

Session = scoped_session(sessionmaker(engine))
session = Session()

newspapers = Table('newspapers', metadata,
    Column('newspaper_id', Integer, primary_key=True),
    Column('newspaper_name', Unicode(128)),
)

entries = Table('entries', metadata,
    Column('entry_id', Integer, primary_key=True),
    Column('newspaper_id', Integer, ForeignKey('newspapers.newspaper_id')),
    Column('entry_key', Unicode(255)),
    Column('entry_value', UnicodeText),
    UniqueConstraint('entry_key', 'entry_value', name="pair"),
)

class Base(object):

    def __init__(self, **kw):
        for key, value in kw.items():
            setattr(self, key, value)

    query = Session.query_property(Query)

def create_entry(key, value):
    return Entry(entry_key=key, entry_value=value)

class Newspaper(Base):

    entries = association_proxy('entry_dict', 'entry_value',
        creator=create_entry)

class Entry(Base):
    pass

mapper(Newspaper, newspapers, properties={
    'entry_dict': relation(Entry,
        collection_class=column_mapped_collection(entries.c.entry_key)),
})
mapper(Entry, entries)

metadata.create_all()

dictionary = {
    u'foo': u'bar',
    u'baz': u'quux'
}

roll = Newspaper(newspaper_name=u"The Toilet Roll")
session.add(roll)
session.flush()

roll.entries = dictionary
session.flush()

for entry in Entry.query.all():
    print entry.entry_key, entry.entry_value
session.commit()

session.expire_all()

print Newspaper.query.filter_by(newspaper_id=1).one().entries

gives

foo bar
baz quux
{u'foo': u'bar', u'baz': u'quux'}

I captured images from http://coverartarchive.org and although slow downloading so many images, pickle had no problem with 155 MB:

$ ll
total 151756
-rw-rw-r--  1 rick rick 155208082 Oct 10 10:04 ipc.pickle

As I move beyond downloading images for just one CD, I'll come back and update this answer with larger pickle limits. Unfortunately I haven't found anywhere that states the pickling limits...

继续阅读：file python

Saving huge bigram dictionary to file using pickle

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？