Better way to sort by frequency in 2D array?
I have this method of sorting which is basically just basic thought processes, not using Perl power, and once in a while it doesn't act how I want it (misses some frequency counting). I was wondering if there was a better way to sort this.
Objective Sort the array based on frequency of matches found.
Sample array of arrays
##ADDED 1 to END of EACH ROW, just because my sort forced me too!!!
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
Current sort
@all_matches = sort {lc($a->[4]) cmp lc($b->[4])} @all_matches;
my ($last_word, $current_word, $word_count);
for my $j (0 .. $#all_matches) {
$current_word = $all_matches[$j][4];
if (lc($last_word) eq lc($current_word)) {
$word_count++;
}
else {
if ($j != 0)
{
for (my $k = 1; $k <= $word_count; $k++)
{
$all_matches[($j-$k)][6] = $word_count;
}
}
$last_word = $current_word;
$word_count = 1;
}
}
@all_matches = sort {$b->[6] <=> $a->[6] || lc($a->[4]) cmp lc($b->[4])} @all_matches;
Problem The 6th column is set to 1 when all_matches is passed in!!! The reason this was done was because sometimes, the count ($match->[6]
) was blank.
Bonus? Match frequency of times the last two columns appear together (right now I'm pretty sure 开发者_开发技巧it just checks 2nd last column). In this test case, the final column is all the same, in the actual case, there are different suffixes on the end (ie. protect, protects, protective etc..)
THANKS a lot for your time. I've tried using a hash, and thought it worked, however it neglected some things.
Here was my hash attempt. Couldn't tell you yet why this didn't work:
my %freq;
foreach ( map{$_->[4]}@results) #feeds in list of animals, cells, uticle, etc.
{
$freq{lc $_}++;
}
@results = sort {$freq{lc $b->[4]} <=> $freq{lc $a->[4]} #freq order
or
$a->[0] cmp $b->[0] #text col 0
} @results;
Why not create a hash of the keys with a count of the occurrences, and use that:
my %counts;
foreach my $rowref (@all_matches)
{
$counts{lc($rowref->[4])}++;
}
@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
lc($a->[4]) cmp lc($b->[4])
} @all_matches;
Tested...
#!/usr/bin/env perl
use strict;
use warnings;
my @all_matches = (
["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
my %counts;
foreach my $rowref (@all_matches)
{
$counts{lc($rowref->[4])}++;
}
@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
lc($a->[4]) cmp lc($b->[4])
} @all_matches;
my $i = 0;
foreach my $rowref (@all_matches)
{
$i++;
print "$i";
print " $_" foreach (@$rowref);
print "\n";
}
Output:
1 chpt12_1 sent. 54 bob nsubj cells protect
2 chpt34_1 sent. 1 dave nsubj cells protect
3 chpt35_1 sent. 2 eli nsubj cells protect
4 chpt10_2 sent. 2 alice nsubj animals protect
5 chpt38_1 sent. 1 fred nsubj animals protect
6 chpt25_4 sent. 47 carol nsubj plants protect
7 chpt54_1 sent. 1 greg nsubj uticle protect
As noted in a comment, given the data shown, the lc
operations are not needed - and removing them would improve performance, as would adding a case-converted key to each array.
And with lc
used once per row - notice the munged data values:
#!/usr/bin/env perl
use strict;
use warnings;
my @all_matches = (
[ "chpt10_2", "sent. 2", "alice", "nsubj", "animAls", "protect" ],
[ "chpt12_1", "sent. 54", "bob", "nsubj", "celLs", "protect" ],
[ "chpt25_4", "sent. 47", "carol", "nsubj", "plAnts", "protect" ],
[ "chpt34_1", "sent. 1", "dave", "nsubj", "cElls", "protect" ],
[ "chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect" ],
[ "chpt38_1", "sent. 1", "fred", "nsubj", "Animals", "protect" ],
[ "chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect" ],
);
my %counts;
foreach my $rowref (@all_matches)
{
push @$rowref, lc($rowref->[4]);
$counts{$rowref->[6]}++;
}
@all_matches = sort { $counts{$b->[6]} <=> $counts{$a->[6]} || $a->[6] cmp $b->[6]
} @all_matches;
my $i = 0;
foreach my $rowref (@all_matches)
{
$i++;
print "$i";
printf " %-9s", $_ foreach (@$rowref);
print "\n";
}
Output:
1 chpt12_1 sent. 54 bob nsubj celLs protect cells
2 chpt34_1 sent. 1 dave nsubj cElls protect cells
3 chpt35_1 sent. 2 eli nsubj cells protect cells
4 chpt10_2 sent. 2 alice nsubj animAls protect animals
5 chpt38_1 sent. 1 fred nsubj Animals protect animals
6 chpt25_4 sent. 47 carol nsubj plAnts protect plants
7 chpt54_1 sent. 1 greg nsubj uticle protect uticle
Try this one:
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
my %wordcount;
foreach my $row (@all_matches) {
$wordcount{$row->[4]}++;
}
my @sorted = sort { $wordcount{$b->[4]} <=> $wordcount{$a->[4]} } @all_matches;
精彩评论