开发者

Multi processing subprocess

I'm new to subprocess module of python, currently my implementation is not multi processed.

import subprocess,shlex
    def forcedParsing(fname):

        cmd = 'strings "%s"' % (fname)
        #print cmd
        args= shlex.split(cmd)
        try:
        开发者_运维知识库    sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
            out, err = sp.communicate()
        except OSError:
            print "Error no %s  Message %s" % (OSError.errno,OSError.message)
            pass

        if sp.returncode== 0:
            #print "Processed %s" %fname
            return out

    res=[]
    for f in file_list: res.append(forcedParsing(f))

my questions:

  1. Is sp.communicate a good way to go? should I use poll?

    if I use poll I need a sperate process which monitors if process finished right?

  2. should I fork at the for loop?


1) subprocess.communicate() seems the right option for what you are trying to do. And you don't need to poll the proces, communicate() returns only when it's finished.

2) you mean forking to paralellize work? take a look at multiprocessing (python >= 2.6). Running parallel processes using subprocess is of course possible but it's quite a work, you cannot just call communicate(), which is blocking.

About your code:

cmd = 'strings "%s"' % (fname)
args= shlex.split(cmd)

Why not simply?

args = ["strings", fname]

As for this ugly pattern:

res=[]
for f in file_list: res.append(forcedParsing(f))

You should use list-comprehensions whenever possible:

res = [forcedParsing(f) for f in file_list]


About question 2: forking at the for loop will mostly speed things up if the script's supposed to run on a system with multiple cores/processors. It will consume more memory, though, and will stress IO harder. There will be a sweet spot somewhere that depends on the number of files in file_list, but only benchmarking on a realistic target system can tell you where it is. If you find that number, you could add an if len(file_list) > <your number>: with optional fork() 'ing [Edit: rather, as @tokland say's via multiprocessing if it's available on your Python version (2.6+)] that chooses the most efficient strategy on a per-job basis.

Read about Python profiling here: http://docs.python.org/library/profile.html

If you're on Linux, you can also run time: http://linuxmanpages.com/man1/time.1.php


There are several warnings in the subprocess documentation that advise you to use communicate to avoid problems with a processes blocking, so it would be a good idea to use that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜