perl sort question
I have some huge log files I need to sort. All entries have a 32 bit hex number which is the sort key I want to use. some entries are one liners like
bla bla bla 0x97860afa bla bla
others are a bit more complex, start with the same type of line above and expand to a block of lines marked by curly brackets like the example below. In this case the entire block has to move to the position defined by the hex nbr. Block example-
bla bla bla 0x97860afc bla bla
bla bla {
blabla
bla bla {
bla
}
}
I can probably figure it out but maybe there is a simple perl or awk solution that will save me 1/2 da开发者_开发知识库y.
Transferring comments from OP:
Indentation can be space or tab, I can enhance that on any proposed solution, I think that Brian summarizes well: Specifically, do you want to sort "items" which are defined as a chunk of text that starts with a line containing a "0xNNNNNNNN", and contains everything up to (but not including) the next line which contains a "0xNNNNNNNN" (where the N's change, of course). No lines interspersed.
Something like this might work (Not tested):
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
$data{$1} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);
The problem is that you said "huge" log files, so storing the file in memory will probably be inefficient. However, if you want to sort it, I suspect you're going to need to do that.
If storing in memory is not an option, you can always just print the data to a file instead, with a format that will allow you to sort it by some other means.
- For Huge data files, I'd recommend
Sort::External
. - It doesn't look like you need to parse the brackets, if the indentation does the job. Then you have to do it on "breaks", or when the indentation level 0, then you process the last record gathered, so you always look ahead one line.
So:
sub to_sort_form {
my $buffer = $_[0];
my ( $id ) = $buffer =~ m/(0x\p{AHex}{8})/; # grab the first candidate
return "$id-:-$buffer";
$_[0] = '';
}
sub to_source {
my $val = shift;
my ( $record ) = $val =~ m/-:-(.*)/;
$record =~ s/\$--\^/\n/g;
return $record;
}
my $sortex = Sort::External->new(
mem_threshold => 1024**2 * 16 # default: 1024**2 * 8 (8 MiB)
, cache_size => 100_000 # default: undef (disabled)
, sortsub => sub { $Sort::External::a cmp $Sort::External::b }
, working_dir => $temp_directory # default: see below
);
my $id;
my $buffer = <>;
chomp $buffer;
while ( <> ) {
my ( $indent ) = m/^(\s*)\S/;
unless ( length $indent ) {
$sortex->feed( to_sort_form( $buffer ));
}
chomp;
$buffer .= $_ . '$--^';
}
$sortex->feed( to_sort_form( $buffer ));
$sortex->finish;
while ( defined( $_ = $sortex->fetch ) ) {
print to_source( $_ );
}
Assumptions:
- The string
'$--^'
does not appear in the data on its own. - That you're not alarmed about two 8-hex-digit strings in one record.
If the files are not too big for memory, I would go with TLP's solution. If they are, you can modify it just a bit and print to a file as he suggests. Add this before the while
(all untested, ymmv, caveat programmer, etc):
my $currentInFile = "";
my $currentOutFileHandle = "";
And change the body of the while
from the current if-else
to
if ($currentInFile ne $ARG) {
if (fileno($currentOutFileHandle)) {
if (!close($currentOutFileHandle)) {
# whatever you want to do if you can't close the previous output file
}
}
my $newOutFile = $ARG . ".tagged";
if (!open($currentOutFileHandle, ">", $newOutFile)) {
# whatever you want to do if you can't open a new output file for writing
}
}
if (...conditional from TLP...) {
# add more zeroes if the files really are that large :)
$lastkey = $1 . " " . sprintf("%0.10d", $.);
}
if (fileno($currentOutFileHandle)) {
print $currentOutFileHandle $lastkey . "\t" . $line;
}
else {
# whatever you want to do if $currentOutFileHandle's gone screwy
}
Now you'll have a foo.log.tagged for each foo.log you fed it; the .tagged file contains exactly the contents of the original, but with "0xNNNNNNNN LLLLLLLLLL\t" (LLLLLLLLLL -> zero-padded line number) prepended to each line. sort(1)
actually does a pretty good job of handling large data, though you'll want to look at the --temporary-directory argument if you think it will overflow /tmp with its temp files while chewing through the stuff you feed it. Something like this should get you started:
sort --output=/my/new/really.big.file --temporary-directory=/scratch/dir/on/roomy/partition *.tagged
Then trim away the tags if desired:
perl -pi -e 's/^[^\t]+\t//' /my/new/really.big.file
FWIW, I padded the line numbers to keep from having to worry about such things as line 10 sorting before line 2 if their hex keys were identical - since the hex numbers are the primary sort criterion, we can't just sort numerically.
One way (untested)
perl -wne'BEGIN{ $key = " " x 10 }' \
-e '$key = $1 if /(0x[0-9a-f]{8})/;' \
-e 'printf "%s%.10d%s", $key, $., $_' \
inputfile \
| LC_ALL=C sort \
| perl -wpe'substr($_,0,20,"")'
The solution from TLP worked nice with some minor tweaks. Adding all in one line before sorting was a good idea, next I have to add a pos parsing to restore the code blocks that got collapsed but that is easy. below is the final tested version. Thank you all, stackoverflow is awesome.
#!/usr/bin/perl -w
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
#my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
my $unique_key = hex($1);
$data{$unique_key} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);
精彩评论