How does data get processed across pipes?
I used this command-line program that I found in another post on SO describing how to spider a website.
wget --spider --force-html -r -l2 http://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' > wget.out
When I crawl a large site, it takes a long time to finish. Meanwhile the wget.out file on disk shows zero size. So when does the piped data get processed and written to the file on disk? Is it after each stage in the pipe having run to completion? In that case, will wget.out fill up after the entire crawling is over?
How do I make the program write intermittently to disk, so that, even开发者_运维技巧 if the crawling stage is interrupted, I have some output saved ?
There is buffering in each pipe, and maybe in the stdio layers of each program. Data will not make it to the disk until the final grep has processed enough lines to cause its buffers to fill to the point of being spilled to disk.
If you run your pipeline on the command-line, and then hit Ctrl-C, sigint
will be sent to every process, terminating each, and losing any pending output.
Either:
Ignore
sigint
in all processes but the first. Bash hackery follows:$ wget --spider --force-html -r -l2 http://example.com 2>&1 grep '^--' | { trap '' int; awk '{ print $3 }'; } | ∶
Simply deliver the keyboard interrupt to the first process. Interactively you can discover the pid with
jobs -l
and thenkill
that. (Run the pipeline in the background.)$ jobs -l [1]+ 10864 Running wget 3364 Running | grep 13500 Running | awk ∶ $ kill -int 10864
Play around with the
disown
bash builtin.
精彩评论