Efficient data migration on a large django table

2023-03-11 07:58 问答作者：

I need to add a new column to a large (5m row) django table. I have a south schem开发者_开发问答amigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)

print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():            
    if cnt % 500 == 0:
        print "    %s done so far" % cnt
    # article.newfield = calculate_newfield(article)
    article.save()
    cnt += 1

I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.

Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?

FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.

Ensure settings.DEBUG is set to False. DEBUG=True fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.

With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.

Welcome to Django's ORM. I think this is an inherent problem.

I've also had problems with large databases, dumpdata, loaddata and the like.

You have two choices.

Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.
Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.

It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.

Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?

a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups

while min < rowcount:
  min += 500
  max = min + 500
  articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
  for old_article in articles:
    # create the new article
    article.save()

orm.Article.objects.iterator()

Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?

I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:

eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
    SELECT  *
    FROM    whatever
""")
for row in old.fetchmany(size=500):
    (col1, col2, col3...) = row
    new = db.cursor()
    new.execute("""
        INSERT INTO yourtable (
            col1, col2, col3...)
        VALUES (
            %s, %s, %s, %s, %s)
        """,(col1, col2, col3,...))
new.close()
old.close()

It will be slow. I pulled this from a standalone migration script of mine so ymmv.

fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.

If you don't need full access to the objects, you can always use a combo of only and values or values_list on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.

继续阅读：django django-models django-south performance python

Efficient data migration on a large django table

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？