Counting total sum of each value in one column w.r.t another in Perl
I have tab delimited data with multiple columns.
I have OS names in column 31 and data bytes in columns 6 and 7. What I want to do is count the total volume of each unique OS.
So, I did something in Perl like this:
#!/usr/bin/perl
use warnings;
my @hhfilelist = glob "*.txt";
my %count = ();
for my $f (@hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my @line = split /\t/;
# counting volumes in col 6 and 7 for 31
$count{$line[30]} = $line[5] + $line[6];
}
close (F);
}
my $w = 0;
foreach $w (sort keys %count) {
print "$w\t$count{$w}\n";
}
So, the result would be something like
Windows 100000
Linux 5000
Mac OSX 15000
Android 2000
But there seems to be some error in this code because the r开发者_StackOverflow中文版esulting values I get aren't as expected.
What am I doing wrong?
It seems like you're not actually ADDING up the counts - you overwrite the last count for any OS with the count from the last line for that OS.
$count{$line[30]} = $line[5] + $line[6];
Should be
$count{$line[30]} += $line[5] + $line[6];
As additional considerations that can improve your code overall but don't affect the correctness of it:
Please use 3-argument form of open and Lexical filehandles:
open(my $filehandle, "<", $f) || die "Cannot open $f: $!";
If you're 100% sure you file doesn't contain quoted field values or tabs in the field contents, your
split
based logic is OK. For really complicated X-separated files, I would strongly recommend usingText::CSV_XS
/Text::CSV
CPAN moduleDon't need to initialize
%count
or$w
variables - hash will get autoinitialized to empty hash, and$w
gets assigned to as loop variable - you may want to actually declare it in the loop itself:foreach my $w (sort keys %count) {
Please don't use 1-letter variables.
$w
is meaningless in the last loop, whereas$os_name
is clear.
Your expression
open F, $f || die "Cannot open $f: $!";
has a subtle bug in it that will eventually bite you, though probably not today.
The ||
operator has higher precedence than the comma-operator to its left and so this expression actually gets parsed as
open F, ($f || die "Cannot open $f: $!")
which is to say, you will die
when $f
has a false (0, ""
, or undef
) value, not when the open
statement fails to open the file with the name given by $f
.
To do what you mean, you could either use parentheses:
open (F, $f) || die ...
or use the alternate low-precedence or
operator
open F, $f or die ...
(At times I have been bitten by this myself)
$count{$line[30]} = $line[5] + $line[6];
should use the += operator to add the row's sum to the total, rather than set it as the total:
$count{$line[30]} += $line[5] + $line[6];
精彩评论