开发者

Multithreaded thumbnail generation in Python

I'd like to recurse a directory of images and generate thumbnails for eac开发者_JS百科h image. I have 12 usable cores on my machine. What's a good way to utilize them? I don't have much experience writing multithreaded applications so any simple sample code is appreciated. Thanks in advance.


Abstract

Use processes, not threads, because Python is inefficient with CPU-intensive threads due to the GIL. Two possible solutions for multiprocessing are:

The multiprocessing module

This is preferred if you're using an internal thumbnail maker (e.g., PIL). Simply write a thumbnail maker function, and launch 12 in parallel. When one of the processes is finished, run another in its slot.

Adapted from the Python documentation, here's a script should utilize 12 cores:

from multiprocessing import Process
import os

def info(title):  # For learning purpose, remove when you got the PID\PPID idea
    print title
    print 'module:', __name__
    print 'parent process:', os.getppid(), 
    print 'process id:', os.getpid()
 
def f(name):      # Working function
    info('function f')
    print 'hello', name

if __name__ == '__main__':
    info('main line')
    processes=[Process(target=f, args=('bob-%d' % i,)) for i  in range(12)]
    [p.start() for p in processes]
    [p.join()  for p in processes]

Addendum: Using multiprocess.pool()

Following soulman's comment, you can use the provided process pull.

I've adapted some code from the multiprocessing manual. Note that you probably should use multiprocessing.cpu_count() instead of 4 to automatically determine the number of CPUs.

from multiprocessing import Pool
import datetime

def f(x):  # You thumbnail maker function, probably using some module like PIL
    print '%-4d: Started at %s' % (x, datetime.datetime.now())
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    print pool.map(f, range(25))          # prints "[0, 1, 4,..., 81]"

Which gives (note that the printouts are not strictly ordered!):

0   : Started at 2011-04-28 17:25:58.992560
1   : Started at 2011-04-28 17:25:58.992749
4   : Started at 2011-04-28 17:25:58.992829
5   : Started at 2011-04-28 17:25:58.992848
2   : Started at 2011-04-28 17:25:58.992741
3   : Started at 2011-04-28 17:25:58.992877
6   : Started at 2011-04-28 17:25:58.992884
7   : Started at 2011-04-28 17:25:58.992902
10  : Started at 2011-04-28 17:25:58.992998
11  : Started at 2011-04-28 17:25:58.993019
12  : Started at 2011-04-28 17:25:58.993056
13  : Started at 2011-04-28 17:25:58.993074
14  : Started at 2011-04-28 17:25:58.993109
15  : Started at 2011-04-28 17:25:58.993127
8   : Started at 2011-04-28 17:25:58.993025
9   : Started at 2011-04-28 17:25:58.993158
16  : Started at 2011-04-28 17:25:58.993161
17  : Started at 2011-04-28 17:25:58.993179
18  : Started at 2011-04-28 17:25:58.993230
20  : Started at 2011-04-28 17:25:58.993233
19  : Started at 2011-04-28 17:25:58.993249
21  : Started at 2011-04-28 17:25:58.993252
22  : Started at 2011-04-28 17:25:58.993288
24  : Started at 2011-04-28 17:25:58.993297
23  : Started at 2011-04-28 17:25:58.993307
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 
 289, 324, 361, 400, 441, 484, 529, 576]

The subprocess module

The subprocess module is useful for running external processes, and thus preferred if you plan on using an external thumbnail maker like imagemagick's convert. Code example:

import subprocess as sp

processes=[sp.Popen('your-command-here', shell=True, 
                    stdout=sp.PIPE, stderr=sp.PIPE) for i in range(12)]

Now, iterate over processes. If any process has finished (using subprocess.poll()), remove it and add a new process to your list.


Like others have answered, subprocesses is usually preferable to threads. multiprocessing.Pool makes it easy to use exactly as many subprocesses as you want, for instance like this:

import os
from multiprocessing import Pool

def process_file(filepath):
    [if filepath is an image file, resize it]

def enumerate_files(folder):
    for dirpath, dirnames, filenames in os.walk(folder):
       for fname in filenames:
           yield os.path.join(dirpath, fname)

if __name__ == '__main__':
    pool = Pool(12) # or omit the parameter to use CPU count
    # use pool.map() only for the side effects, ignore the return value
    pool.map(process_file, enumerate_files('.'), chunksize=1)

The chunksize=1 parameter makes sense if each file operation is relatively slow compared to communicating with each subprocess.


Don't go with threads, they are too complicated for what you want. Instead, use the subprocess library to spawn separate processes working through each directory.

So you will have a primary program that generates a list of files, then starts popping each file off the list and feeding it into a subprocess. The subprocess would be a simple python program to generate a thumbnail from an input image. Some simple logic to keep your spawned processes within a limited set, say 11, would keep you from forkbombing your machine.

This allows the os to handle all of those niggling details of who runs where and so on.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜