Efficient data migration on a large django table
I need to add a new column to a large (5m row) django table. I have a south schem开发者_开发问答amigration
that creates the new column. Now I'm writing a datamigration
script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm.
prefixing the model name.)
print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():
if cnt % 500 == 0:
print " %s done so far" % cnt
# article.newfield = calculate_newfield(article)
article.save()
cnt += 1
I switched from objects.all
to objects.iterator
to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.
Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?
FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.
Ensure settings.DEBUG
is set to False
. DEBUG=True
fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.
With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.
Welcome to Django's ORM. I think this is an inherent problem.
I've also had problems with large databases, dumpdata, loaddata and the like.
You have two choices.
Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.
Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.
It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.
Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?
a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups
while min < rowcount:
min += 500
max = min + 500
articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
for old_article in articles:
# create the new article
article.save()
orm.Article.objects.iterator()
Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?
I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:
eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
SELECT *
FROM whatever
""")
for row in old.fetchmany(size=500):
(col1, col2, col3...) = row
new = db.cursor()
new.execute("""
INSERT INTO yourtable (
col1, col2, col3...)
VALUES (
%s, %s, %s, %s, %s)
""",(col1, col2, col3,...))
new.close()
old.close()
It will be slow. I pulled this from a standalone migration script of mine so ymmv.
fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.
If you don't need full access to the objects, you can always use a combo of only
and values
or values_list
on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.
精彩评论