bash gnu parallel help
its about http://en.wikipedia.org/wiki/Parallel_(software) and very rich manpage http://www.gnu.org/software/parallel/man.html
(for x in `cat list` ; do
do_something $x
done) | process_output
is replaced by this
cat list | parallel do_something | process_output
i am trying to implement that on this
while [ "$n" -开发者_StackOverflow社区gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
while reading documentation for gnu parallel
i found out that functions are not supported so getlinks wont be used in parallel
best i have found so far is
seq 30 | parallel -n 4 --colsep ' ' echo {1} {2} {3} {4}
makes output
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26 27 28
29 30
while loop mentioned above should go like this if I am right
end=`cat done1 |wc -l`
seq $end -1 1 | parallel -j+4 -k
#(all exept getlinks function goes here, but idk how? )|
# everytime it finishes do
getlinks $nextUrls
thx for help in advance
It seems what you want is a progress meter. Try:
cat done1 | parallel --eta wget
If that is not what you want, look at sem
(sem
is an alias for parallel --semaphore
and is normally installed with GNU Parallel):
for i in `ls *.log` ; do
echo $i
sem -j+0 gzip $i ";" echo done
done
sem --wait
In your case it will be something like:
while [ "$n" -gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
THE_URL=`getlinks $nextUrls`
sem -j10 wget $THE_URL
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
sem --wait
echo All done
Why does getlinks need to be a function? Take the function and transform it into a shell script (should be essentially identical except you need to export environmental variables in and you of course cannot affect the outside environment without lots of work).
Of course, you cannot save $n into currentline when you are trying to execute in parallel. All files will be overwriting each other at the same time.
i was thinking of makeing something more like this, if not parallel or sam something else because parallel does not supprot funcitons aka http://www.gnu.org/software/parallel/man.html#aliases_and_functions_do_not_work
getlinks(){
if [ -n "$1" ]
then
lynx -image_links -dump "$1" > src
grep -i ".jpg" < src > links1
grep -i "http" < links1 >links
sed -e 's/.*\(http\)/http/g' < links >> done1
sort -f done1 > done2
uniq done2 > done1
rm -rf links1 links src done2
fi
}
func(){
percentage=${"scale=2;(100-(($1 / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${$1}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $1 > currentLine
let "$1--"
let "end=`cat done1 |wc -l`"
}
while [ "$n" -gt 0 ]
do
sem -j10 func $n
done
sem --wait
echo All done
My script has become really complex, and i do not want to make a feature unavailable with something i am not sure it can be done this way i can get links with full internet traffic been used, should take less time that way
tryed sem
#!/bin/bash
func (){
echo 1
echo 2
}
for i in `seq 10`
do
sem -j10 func
done
sem --wait
echo All done
you get
errors
Can't exec "func": No such file or directory at /usr/share/perl/5.10/IPC/Open3.p
m line 168.
open3: exec of func failed at /usr/local/bin/sem line 3168
It is not quite clear what the end goal of your script is. If you are trying to write a parallel web crawler, you might be able to use the below as a template.
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN
精彩评论