How to update 400k rows on a production MySQL db and don't kill it

2023-01-20 22:59 问答作者：

On our production server we need to split 900k images into different dirs and update 400k rows (MySQL with InnoDB engine). I wrote a python script which goes through next steps:

Select small chunk of data from db (10 rows)
Make new dirs
Copy files to the created dirs and rename it
Update db (there are some triggers on update which will load server)
Repeat

My code:


import os, shutil
import database # database.py from tornado

LIMIT_START_OFFSET = 0
LIMIT_ROW_COUNT = 10
SRC_PATHS = ('/var/www/site/public/upload/images/',)
DST_PATH = '/var/www/site/public/upload/new_images/'

def main():
    offset = LIMIT_START_OFFSET
    while True:
        db = Connection(DB_HOST, DB_NAME, DB_USER, DB_PASSWD)
        db_data = db.query('''
            SELECT id AS news_id, image AS src_filename
            FROM emd_news
            ORDER BY id ASC
            LIMIT %s, %s''', offset, LIMIT_ROW_COUNT)
        offset = offset + LIMIT_ROW_COUNT
        news_images = get_news_images(db_data) # convert data to easy-to-use list
        make_dst_dirs(DST_PATH, [i['dst_dirname'] for i in news_images]) # make news dirs
        news_to_update = copy_news_images(SRC_PATHS, DST_PATH, news_images) # list of moved files
        db.executemany('''
            UPDATE emd_news
            SET image = %s
            WHERE id = %s
            LIMIT 1''', [(i['filename'], i['news_id']) for i in news_to_update])
        db.close()
        if not db_data: break

if __name__ == '__main__':
    main()

Quite simple task, but I'm a little bit nervous about performance.

How can I make this script more efficient?

UPD: Af开发者_运维知识库ter all I've used original script without any modifications. It took about 5 hours. And it was fast in the beginning and very slow in the end.

i will Add:

Why you create in each loop a new connexion and close it heh !!!

And maybe you can use db.autocommit(False) specially for the UPDATE and do a db.commit() for each 100 rows or something ;

and like Alin Purcaru you should do some benchmark as well.

Hope this can help :)

What I recommend.

Add an isProcessed column to your table.
Make your script work on a chunk of, say, 1k rows for the first run (of course select only rows that are not processed).
Benchmark it.
Adjust the chunk size if needed.
Build another script that calls this one at intervals.

Don't forget to add some sleep time in both your scripts!

This will work if your change does not need to be continuous (and I don't think it has to be). If you have to do it all at once you should put your database offline during the time the script runs.

    db_data = db.query('''
        SELECT id AS news_id, image AS src_filename
        FROM emd_news
        ORDER BY id ASC
        LIMIT %s, %s''', offset, LIMIT_ROW_COUNT)
     # Why is there any code here at all?  If there's no data, why proceed?
     if not db_data: break

继续阅读：performance python

How to update 400k rows on a production MySQL db and don't kill it

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？