开发者

I suspect Popen to timeout without saying

I'm having some difficulties with my scripts. The purpose is to launch one or several OpenVZ container to execute some test. Those test can be very long (about 3 hours usually).

The first script goes this way, after sorting the queue member to launch, it does:

subprocess.Popen(QUEUE_EXECUTER % queue['queue_id'], shell=True)

Where "QUEUE_EXECUTER % queue['queue_id']" is the complete command to run. In the queue_executer script 开发者_如何学Pythonit goes this way :

# Launching install
cmd = queue['cmd_install']
report_install = open(queue['report_install'], 'a')
process_install = subprocess.Popen(cmd, shell=True, stdout=report_install, stderr=subprocess.STDOUT)
process_install.wait()

# Launching test
logger.debug('Launching test')
report_test = open(queue['report_test'], 'a')
cmd = queue['cmd_test']
process_test = subprocess.Popen(cmd, shell=True, stdout=report_test, stderr=subprocess.STDOUT)
process_test.wait()

It works quite fine, but some time, and more recently, most of the time, the execution is stopped. No error in the logs or anything. The report file shows that it stopped right in the middle of the writing of a line (which, I believe is because the file isn't correctly close on the python side). On the host side the OOM killer don't seem to do anything, and I've searched through the host's logs without finding anything either.

The two "cmd" launched above are shell script which basically set a vz up, and execute a test program on it.

So my big question is : Am I missing something which would cause the scripts to stop on the python side ?

Thanks.

EDIT : Some complementary informations.

The command which fail is always the second one. Here are two example values of the commands I try to execute : /path/vzspawncluster.sh /tmp/file web --tarball /services/pkgs/etch/releases/archive.tar.gz --create and /path/vzlaunch.sh 172 -b trunk --args "-a -v -s --time --cluster --sql=qa3 --queue=223 --html --mail=adress@mail.com"

The vzlaunch script launch a python script on a OpenVZ container with vzctl enter ID /path/script.py where ID is the container ID and /path/script.py the script on the container.

The machine report_install and report_test are files situated on a different machine accessed through a NFS share. That should not matter, but as I really don't know what going on when it fails, I note it anyway.

When it fails, the process on the container die. It does not remain in any state of zombieness or anything, it's just dead. Although the process on the container fails, the the main process (the one that launchs them all) continue as if everything was fine.

Some more info: I tried the buffer-flushing approach pointed by smci but the writing of my log file keep being cut right in the middle of a line :

[18:55:27][Scripteo]       Create process '/QA/valideo.trunk/tests/756/test.py -i 10.1.11.122 --report --verbose --name 756 --...
[18:56:35][Scripteo]       Create process '/QA/valideo.trunk/tests/762/test.py -i 10.1.11.122 --report --verbose --name 762 --...
[18:57:56][Scripteo]       Create process '/QA/valideo.trunk/tests/764/test.py -i 10.1.11.122 --report --verbose --name 764 --...
[18:59:27][Scripteo]       Create process '/QA/valideo.trunk/tests/789/test.py -i 10.1.11.122 --report --verbose --name 789 --...
[19:00:44][Scripteo]       Create process '/QA/valideo.trunk/tests/866/test.py -i 10.1.11.122 --report --verbose --name 866 --...
[19:02:27][Scripteo]       Create process '/QA/valideo.trunk/tests/867/test.py -i 10.1.11.122 --report --verbose --name 867 --...
[19:04:13][Scripteo]       Create process '/QA/valideo.trunk/tests/874/t


Your intent is first to run process_install until it finishes, then run process_wait? (sequentially, not multiprocessing, right?) Which command do you suspect to timeout?

Please paste the actual values of queue['cmd_install'], queue['cmd_test']

(Does either of those commands have a trailing '&' or redirects?)

Here are my debugging suggestions:

  • (I don't know OpenVZ, but I assume you've checked the logs and whether it allows running at-exit commands)

  • Are you running on UNIX? If so, you could play around with the commands to run cmd in the background and also run a loop to generate output e.g. a while(1) to touch a sentinel file then sleep 10s. Or you could cmd; touch donesentinel.

  • Try adding a polling loop to poll() each Popen object every interval, instead of wait().

  • Alternatively, print Popen.pid after it launches, then externally check or poll that process is still alive (e.g. with UNIX top -p).

  • If your process generates a lotta output, did you note the caveat on Popen.wait()? Warning: This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

  • If you suspect that is happening, redirect either/both of stdout, stderr to os.devnull and see whether your results differ. Or see this buffer-flushing approach.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜