"find" and "ls" with GNU parallel

2023-04-09 07:58 问答作者：

I'm trying to use GNU parallel to post a lot of files to a web server. In my directory, I have some files:

file1.xml
file2.xml

and I have a shell script that looks like this:

开发者_如何学Go#! /usr/bin/env bash

CMD="curl -X POST -d@$1 http://server/path"

eval $CMD

There's some other stuff in the script, but this was the simplest example. I tried to execute the following command:

ls | parallel -j2 script.sh {}

Which is what the GNU parallel pages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do:

find . -name '*.xml' | parallel -j2 script.sh {}

it works fine. Is there a difference between how ls and find are passing arguments to my script? Or do I need to do something additional in that script?

GNU parallel is a variant of xargs. They both have very similar interfaces, and if you're looking for help on parallel, you may have more luck looking up information about xargs.

That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works.

Here is an example:

> echo "foo    bar \
  baz" | xargs echo
foo bar baz

There are some problems with the default behavior, so it is common to see several variations.

The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. find even provides an option to make this easy to do:

> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!

The -print0 option tells find to seperate files with the NULL character instead of whitespace.
The -0 option tells xargs to use the NULL character to tokenize each argument.

Note that parallel is a little better than xargs in that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior.

Another common issue is that you may want to control how the arguments are passed to xargs or parallel. If you need to have a specific placement of the arguments passed to the program, you can use {} to specify where the argument is to be placed.

> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir

This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following:

> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir

So taking into consideration how xargs and parallel work, you should hopefully be able to see the issue with your command. find . -name '*.xml' will generate a list of xml files to be passed to the script.sh program.

> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml

However, ls | parallel -j2 script.sh {} will generate a list of ALL files in the current directory to be passed to the script.sh program.

> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...

A more correct variant on the ls version would be as follows:

> ls *.xml | parallel -j2 script.sh {}

However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent find version of the above ls command would be as follows:

> find -maxdepth 1 -name '*.xml'

This will only search the current directory.

Since it works with find you probably want to see what command GNU Parallel is running (using -v or --dryrun) and then try to run the failing commands manually.

ls *.xml | parallel --dryrun -j2 script.sh
find -maxdepth 1 -name '*.xml' | parallel --dryrun -j2 script.sh

I have not used parallel but there is a different between ls & find . -name '*.xml'. ls will list all the files and directories where as find . -name '*.xml' will list only the files (and directories) which end with a .xml.
As suggested by Paul Rubel, just print the value of $1 in your script to check this. Additionally you may want to consider filtering the input to files only in find with the -type f option.
Hope this helps!

Neat.

I had never used parallel before. It appears, though that there are two of them. One is the Gnu Parrallel, and the one that was installed on my system has Tollef Fog Heen listed as the author in the man pages.

As Paul mentioned, you should use set -x

Also, the paradigm that you mentioned above doesn't seem to work on my parallel, rather, I have to do the following:

$ cat ../script.sh
+ cat ../script.sh
#!/bin/bash
echo $@
$ parallel -ij2 ../script.sh {} -- $(find -name '*.xml')
++ find -name '*.xml'
+ parallel -ij2 ../script.sh '{}' -- ./b.xml ./c.xml ./a.xml ./d.xml ./e.xml
./c.xml
./b.xml
./d.xml
./a.xml
./e.xml
$ parallel -ij2 ../script.sh {} -- $(ls *.xml)
++ ls --color=auto a.xml b.xml c.xml d.xml e.xml
+ parallel -ij2 ../script.sh '{}' -- a.xml b.xml c.xml d.xml e.xml
b.xml
a.xml
d.xml
c.xml
e.xml

find does provide a different input, It prepends the relative path to the name. Maybe that is what is messing up your script?

继续阅读：bash find gnu-parallel parallel-processing

"find" and "ls" with GNU parallel

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？