Multithreaded thumbnail generation in Python
I'd like to recurse a directory of images and generate thumbnails for eac开发者_JS百科h image. I have 12 usable cores on my machine. What's a good way to utilize them? I don't have much experience writing multithreaded applications so any simple sample code is appreciated. Thanks in advance.
Abstract
Use processes, not threads, because Python is inefficient with CPU-intensive threads due to the GIL. Two possible solutions for multiprocessing are:
The multiprocessing
module
This is preferred if you're using an internal thumbnail maker (e.g., PIL
). Simply write a thumbnail maker function, and launch 12 in parallel. When one of the processes is finished, run another in its slot.
Adapted from the Python documentation, here's a script should utilize 12 cores:
from multiprocessing import Process
import os
def info(title): # For learning purpose, remove when you got the PID\PPID idea
print title
print 'module:', __name__
print 'parent process:', os.getppid(),
print 'process id:', os.getpid()
def f(name): # Working function
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
processes=[Process(target=f, args=('bob-%d' % i,)) for i in range(12)]
[p.start() for p in processes]
[p.join() for p in processes]
Addendum: Using multiprocess.pool()
Following soulman's comment, you can use the provided process pull.
I've adapted some code from the multiprocessing manual
. Note that you probably should use multiprocessing.cpu_count()
instead of 4
to automatically determine the number of CPUs.
from multiprocessing import Pool
import datetime
def f(x): # You thumbnail maker function, probably using some module like PIL
print '%-4d: Started at %s' % (x, datetime.datetime.now())
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
print pool.map(f, range(25)) # prints "[0, 1, 4,..., 81]"
Which gives (note that the printouts are not strictly ordered!):
0 : Started at 2011-04-28 17:25:58.992560
1 : Started at 2011-04-28 17:25:58.992749
4 : Started at 2011-04-28 17:25:58.992829
5 : Started at 2011-04-28 17:25:58.992848
2 : Started at 2011-04-28 17:25:58.992741
3 : Started at 2011-04-28 17:25:58.992877
6 : Started at 2011-04-28 17:25:58.992884
7 : Started at 2011-04-28 17:25:58.992902
10 : Started at 2011-04-28 17:25:58.992998
11 : Started at 2011-04-28 17:25:58.993019
12 : Started at 2011-04-28 17:25:58.993056
13 : Started at 2011-04-28 17:25:58.993074
14 : Started at 2011-04-28 17:25:58.993109
15 : Started at 2011-04-28 17:25:58.993127
8 : Started at 2011-04-28 17:25:58.993025
9 : Started at 2011-04-28 17:25:58.993158
16 : Started at 2011-04-28 17:25:58.993161
17 : Started at 2011-04-28 17:25:58.993179
18 : Started at 2011-04-28 17:25:58.993230
20 : Started at 2011-04-28 17:25:58.993233
19 : Started at 2011-04-28 17:25:58.993249
21 : Started at 2011-04-28 17:25:58.993252
22 : Started at 2011-04-28 17:25:58.993288
24 : Started at 2011-04-28 17:25:58.993297
23 : Started at 2011-04-28 17:25:58.993307
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256,
289, 324, 361, 400, 441, 484, 529, 576]
The subprocess
module
The subprocess
module is useful for running external processes, and thus preferred if you plan on using an external thumbnail maker like imagemagick
's convert
. Code example:
import subprocess as sp
processes=[sp.Popen('your-command-here', shell=True,
stdout=sp.PIPE, stderr=sp.PIPE) for i in range(12)]
Now, iterate over processes. If any process has finished (using subprocess.poll()
), remove it and add a new process to your list.
Like others have answered, subprocesses is usually preferable to threads. multiprocessing.Pool makes it easy to use exactly as many subprocesses as you want, for instance like this:
import os
from multiprocessing import Pool
def process_file(filepath):
[if filepath is an image file, resize it]
def enumerate_files(folder):
for dirpath, dirnames, filenames in os.walk(folder):
for fname in filenames:
yield os.path.join(dirpath, fname)
if __name__ == '__main__':
pool = Pool(12) # or omit the parameter to use CPU count
# use pool.map() only for the side effects, ignore the return value
pool.map(process_file, enumerate_files('.'), chunksize=1)
The chunksize=1 parameter makes sense if each file operation is relatively slow compared to communicating with each subprocess.
Don't go with threads, they are too complicated for what you want. Instead, use the subprocess library to spawn separate processes working through each directory.
So you will have a primary program that generates a list of files, then starts popping each file off the list and feeding it into a subprocess. The subprocess would be a simple python program to generate a thumbnail from an input image. Some simple logic to keep your spawned processes within a limited set, say 11, would keep you from forkbombing your machine.
This allows the os to handle all of those niggling details of who runs where and so on.
精彩评论