开发者

Better way to sort by frequency in 2D array?

I have this method of sorting which is basically just basic thought processes, not using Perl power, and once in a while it doesn't act how I want it (misses some frequency counting). I was wondering if there was a better way to sort this.

Objective Sort the array based on frequency of matches found.

Sample array of arrays

##ADDED 1 to END of EACH ROW, just because my sort forced me too!!!
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
               ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
               ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
               ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
               ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
               ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
               ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
              );

Current sort

@all_matches = sort {lc($a->[4]) cmp lc($b->[4])} @all_matches;

my ($last_word, $current_word, $word_count);

for my $j (0 .. $#all_matches) {

    $current_word = $all_matches[$j][4];

    if (lc($last_word) eq lc($current_word)) {
        $word_count++;
        }
    else {
        if ($j != 0)
        {
            for (my $k = 1; $k <= $word_count; $k++)
            {
               $all_matches[($j-$k)][6] = $word_count; 
            }
        }
        $last_word = $current_word;
        $word_count = 1;
        }
}
@all_matches = sort {$b->[6] <=> $a->[6] || lc($a->[4]) cmp lc($b->[4])} @all_matches;

Problem The 6th column is set to 1 when all_matches is passed in!!! The reason this was done was because sometimes, the count ($match->[6]) was blank.

Bonus? Match frequency of times the last two columns appear together (right now I'm pretty sure 开发者_开发技巧it just checks 2nd last column). In this test case, the final column is all the same, in the actual case, there are different suffixes on the end (ie. protect, protects, protective etc..)

THANKS a lot for your time. I've tried using a hash, and thought it worked, however it neglected some things.

Here was my hash attempt. Couldn't tell you yet why this didn't work:

my %freq;
foreach ( map{$_->[4]}@results) #feeds in list of animals, cells, uticle, etc.
{
   $freq{lc $_}++;
}


@results = sort {$freq{lc $b->[4]} <=> $freq{lc $a->[4]} #freq order
                                   or
                         $a->[0]  cmp $b->[0]            #text col 0      
                } @results; 


Why not create a hash of the keys with a count of the occurrences, and use that:

my %counts;
foreach my $rowref (@all_matches)
{
     $counts{lc($rowref->[4])}++;
}

@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
                      lc($a->[4]) cmp lc($b->[4])
                    } @all_matches;

Tested...

#!/usr/bin/env perl

use strict;
use warnings;

my @all_matches = (
    ["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
    ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
    ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
    ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
    ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
    ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
    ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
    );

my %counts;
foreach my $rowref (@all_matches)
{
    $counts{lc($rowref->[4])}++;
}

@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
                      lc($a->[4]) cmp lc($b->[4])
                    } @all_matches;

my $i = 0;
foreach my $rowref (@all_matches)
{
    $i++;
    print "$i";
    print " $_" foreach (@$rowref);
    print "\n";
}

Output:

1 chpt12_1 sent. 54 bob nsubj cells protect
2 chpt34_1 sent. 1 dave nsubj cells protect
3 chpt35_1 sent. 2 eli nsubj cells protect
4 chpt10_2 sent. 2 alice nsubj animals protect
5 chpt38_1 sent. 1 fred nsubj animals protect
6 chpt25_4 sent. 47 carol nsubj plants protect
7 chpt54_1 sent. 1 greg nsubj uticle protect

As noted in a comment, given the data shown, the lc operations are not needed - and removing them would improve performance, as would adding a case-converted key to each array.

And with lc used once per row - notice the munged data values:

#!/usr/bin/env perl

use strict;
use warnings;

my @all_matches = (
    [ "chpt10_2", "sent. 2",  "alice", "nsubj", "animAls", "protect" ],
    [ "chpt12_1", "sent. 54", "bob",   "nsubj", "celLs",   "protect" ],
    [ "chpt25_4", "sent. 47", "carol", "nsubj", "plAnts",  "protect" ],
    [ "chpt34_1", "sent. 1",  "dave",  "nsubj", "cElls",   "protect" ],
    [ "chpt35_1", "sent. 2",  "eli",   "nsubj", "cells",   "protect" ],
    [ "chpt38_1", "sent. 1",  "fred",  "nsubj", "Animals", "protect" ],
    [ "chpt54_1", "sent. 1",  "greg",  "nsubj", "uticle",  "protect" ],
    );

my %counts;
foreach my $rowref (@all_matches)
{
    push @$rowref, lc($rowref->[4]);
    $counts{$rowref->[6]}++;
}

@all_matches = sort { $counts{$b->[6]} <=> $counts{$a->[6]} || $a->[6] cmp $b->[6]
                    } @all_matches;

my $i = 0;
foreach my $rowref (@all_matches)
{
    $i++;
    print "$i";
    printf " %-9s", $_ foreach (@$rowref);
    print "\n";
}

Output:

1 chpt12_1  sent. 54  bob       nsubj     celLs     protect   cells    
2 chpt34_1  sent. 1   dave      nsubj     cElls     protect   cells    
3 chpt35_1  sent. 2   eli       nsubj     cells     protect   cells    
4 chpt10_2  sent. 2   alice     nsubj     animAls   protect   animals  
5 chpt38_1  sent. 1   fred      nsubj     Animals   protect   animals  
6 chpt25_4  sent. 47  carol     nsubj     plAnts    protect   plants   
7 chpt54_1  sent. 1   greg      nsubj     uticle    protect   uticle   


Try this one:

my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
        ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
        ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
        ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
        ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
        ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
        ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
        );

my %wordcount;

foreach my $row (@all_matches) {
        $wordcount{$row->[4]}++;
}

my @sorted = sort { $wordcount{$b->[4]} <=> $wordcount{$a->[4]}  } @all_matches;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜