Does a PBS batch system move multiple serial jobs across nodes?

2023-02-20 08:50 问答作者：

If I need to run many serial programs "in parallel" (because the problem is simple but time consuming - I need to read in many different data sets for the same program), the solution is simple if I only use one node. All I do is kee开发者_JS百科p submitting serial jobs with an ampersand after each command, e.g. in the job script:

./program1 &
./program2 &
./program3 &
./program4

which will naturally run each serial program on a different processor. This works well on a login server or standalone workstation, and of course for a batch job asking for only one node.

But what if I need to run 110 different instances of the same program to read 110 different data sets? If I submit to multiple nodes (say 14) with a script which submits 110 ./program# commands, will the batch system run each job on a different processor across the different nodes, or will it try to run them all on the same, 8 core node?

I have tried to use a simple MPI code to read different data, but various errors result, with about 100 out of the 110 processes succeeding, and the others crashing. I have also considered job arrays, but I'm not sure if my system supports it.

I have tested the serial program extensively on individual data sets - there are no runtime errors, and I do not exceed the available memory on each node.

No, PBS won't automatically distribute the jobs among nodes for you. But this is a common thing to want to do, and you have a few options.

Easiest and in some ways most advantagous for you is to bunch the tasks into 1-node sized chunks, and submit those bundles as individual jobs. This will get your jobs started faster; a 1-node job will normally get scheduled faster than a (say) 14 node job, just because there's more one-node sized holes in the schedule than 14. This works particularly well if all the jobs take roughly the same amount of time, because then doing the division is pretty simple.
If you do want to do it all in one job (say, to simplify the bookkeeping), you may or may not have access to the pbsdsh command; there's a good discussion of it here. This lets you run a single script on all the processors in your job. You then write a script which queries $PBS_VNODENUM to find out which of the nnodes*ppn jobs it is, and runs the appropriate task.
If not pbsdsh, Gnu parallel is another tool which can enormously simplify these tasks. It's like xargs, if you're familiar with that, but will run commands in parallel, including on multiple nodes. So you'd submit your (say) 14-node job and have the first node run a gnu parallel script. The nice thing is that this will do scheduling for you even if the jobs are not all of the same length. The advice we give to users on our system for using gnu parallel for these sorts of things is here. Note that if gnu parallel isn't installed on your system, and for some reason your sysadmins won't do it, you can set it up in your home directory, it's not a complicated build.

You should consider job arrays.

Briefly, you insert #PBS -t 0-109 in your shell script (where the range 0-109 can be any integer range you want, but you stated you had 110 datasets) and torque will:

run 110 instances of your script, allocating each with the resources you specify (in the script with #PBS tags or as arguments when you submit).
assign a unique integer from 0 to 109 to the environment variable PBS_ARRAYID for each job.

Assuming you have access to environment variables within the code, you can just tell each job to run on data set number PBS_ARRAYID.

继续阅读：pbs torque

Does a PBS batch system move multiple serial jobs across nodes?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？