Splitting gzipped logfiles without storing the ungzipped splits on disk
I have a recurring task of splitting a set of large (about 1-2 GiB each) gzipped Apache logfiles into several parts (say chunks of 500K lines). The final files should be gzipped again to limit the disk usage.
On Linux I would normally do:
zcat biglogfile.gz | split -l500000
The resulting files开发者_如何学运维 files will be named xaa, xab, xac, etc So I do:
gzip x*
The effect of this method is that as an intermediate result these huge files are temporarily stored on disk. Is there a way to avoid this intermediate disk usage?
Can I (in a way similar to what xargs does) have split pipe the output through a command (like gzip) and recompress the output on the fly? Or am I looking in the wrong direction and is there a much better way to do this?
Thanks.
You can use the split --filter
option as explained in the manual e.g.
zcat biglogfile.gz | split -l500000 --filter='gzip > $FILE.gz'
Edit: not aware when --filter
option was introduced but according to comments, it is not working in core utils 8.4
.
A script like the following might suffice.
#!/usr/bin/perl
use PerlIO::gzip;
$filename = 'out';
$limit = 500000;
$fileno = 1;
$line = 0;
while (<>) {
if (!$fh || $line >= $limit) {
open $fh, '>:gzip', "$filename_$fileno";
$fileno++;
$line = 0;
}
print $fh $_; $line++;
}
In case people need to keep the 1st row (the header) in each of the pieces
zcat bigfile.csv.gz | tail -n +2 | split -l1000000 --filter='{ { zcat bigfile.csv.gz | head -n 1 | gzip; gzip; } > $FILE.gz; };'
I know it's a bit clunky. I'm looking for a more elegant solution.
There's zipsplit, but that uses the zip algorithm as opposed to the gzip algorithm.
精彩评论