开发者

Efficient data migration on a large django table

I need to add a new column to a large (5m row) django table. I have a south schem开发者_开发问答amigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)

print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():            
    if cnt % 500 == 0:
        print "    %s done so far" % cnt
    # article.newfield = calculate_newfield(article)
    article.save()
    cnt += 1

I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.

Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?

FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.


Ensure settings.DEBUG is set to False. DEBUG=True fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.

With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.


Welcome to Django's ORM. I think this is an inherent problem.

I've also had problems with large databases, dumpdata, loaddata and the like.

You have two choices.

  1. Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.

  2. Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.

It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.


Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?

a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups

while min < rowcount:
  min += 500
  max = min + 500
  articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
  for old_article in articles:
    # create the new article
    article.save()


orm.Article.objects.iterator()

Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?

I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:

eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
    SELECT  *
    FROM    whatever
""")
for row in old.fetchmany(size=500):
    (col1, col2, col3...) = row
    new = db.cursor()
    new.execute("""
        INSERT INTO yourtable (
            col1, col2, col3...)
        VALUES (
            %s, %s, %s, %s, %s)
        """,(col1, col2, col3,...))
new.close()
old.close()

It will be slow. I pulled this from a standalone migration script of mine so ymmv.

fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.


If you don't need full access to the objects, you can always use a combo of only and values or values_list on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜