开发者

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.

For example suppos开发者_Go百科e I just want to run wc on the files. So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?

Is there any utility that will recursively cat two files in parallel?

I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.


GNU parallel claims that it:

makes sure output from the commands is the same output as you would get had you run the commands sequentially

If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.

Use the -k option to maintain the order of the output.

Update: (non-Perl solution)

Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.

The feature listing claims:

Does internal buffering and locking to prevent mangling/interleaving of output from separate jobs.

so it should meet your needs as long as order of output is not important

However, note on the following statement on this page:

prll generates a lot of status information on STDERR which makes it harder to use the STDERR output of the job directly as input for another program.


Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜