Better way to sort by frequency in 2D array?

2023-03-15 15:13 问答作者：

I have this method of sorting which is basically just basic thought processes, not using Perl power, and once in a while it doesn't act how I want it (misses some frequency counting). I was wondering if there was a better way to sort this.

Objective Sort the array based on frequency of matches found.

Sample array of arrays

##ADDED 1 to END of EACH ROW, just because my sort forced me too!!!
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
               ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
               ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
               ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
               ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
               ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
               ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
              );

Current sort

@all_matches = sort {lc($a->[4]) cmp lc($b->[4])} @all_matches;

my ($last_word, $current_word, $word_count);

for my $j (0 .. $#all_matches) {

    $current_word = $all_matches[$j][4];

    if (lc($last_word) eq lc($current_word)) {
        $word_count++;
        }
    else {
        if ($j != 0)
        {
            for (my $k = 1; $k <= $word_count; $k++)
            {
               $all_matches[($j-$k)][6] = $word_count; 
            }
        }
        $last_word = $current_word;
        $word_count = 1;
        }
}
@all_matches = sort {$b->[6] <=> $a->[6] || lc($a->[4]) cmp lc($b->[4])} @all_matches;

Problem The 6th column is set to 1 when all_matches is passed in!!! The reason this was done was because sometimes, the count ($match->[6]) was blank.

Bonus? Match frequency of times the last two columns appear together (right now I'm pretty sure 开发者_开发技巧it just checks 2nd last column). In this test case, the final column is all the same, in the actual case, there are different suffixes on the end (ie. protect, protects, protective etc..)

THANKS a lot for your time. I've tried using a hash, and thought it worked, however it neglected some things.

Here was my hash attempt. Couldn't tell you yet why this didn't work:

my %freq;
foreach ( map{$_->[4]}@results) #feeds in list of animals, cells, uticle, etc.
{
   $freq{lc $_}++;
}


@results = sort {$freq{lc $b->[4]} <=> $freq{lc $a->[4]} #freq order
                                   or
                         $a->[0]  cmp $b->[0]            #text col 0      
                } @results;

Why not create a hash of the keys with a count of the occurrences, and use that:

my %counts;
foreach my $rowref (@all_matches)
{
     $counts{lc($rowref->[4])}++;
}

@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
                      lc($a->[4]) cmp lc($b->[4])
                    } @all_matches;

Tested...

#!/usr/bin/env perl

use strict;
use warnings;

my @all_matches = (
    ["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
    ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
    ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
    ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
    ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
    ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
    ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
    );

my %counts;
foreach my $rowref (@all_matches)
{
    $counts{lc($rowref->[4])}++;
}

@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
                      lc($a->[4]) cmp lc($b->[4])
                    } @all_matches;

my $i = 0;
foreach my $rowref (@all_matches)
{
    $i++;
    print "$i";
    print " $_" foreach (@$rowref);
    print "\n";
}

Output:

1 chpt12_1 sent. 54 bob nsubj cells protect
2 chpt34_1 sent. 1 dave nsubj cells protect
3 chpt35_1 sent. 2 eli nsubj cells protect
4 chpt10_2 sent. 2 alice nsubj animals protect
5 chpt38_1 sent. 1 fred nsubj animals protect
6 chpt25_4 sent. 47 carol nsubj plants protect
7 chpt54_1 sent. 1 greg nsubj uticle protect

As noted in a comment, given the data shown, the lc operations are not needed - and removing them would improve performance, as would adding a case-converted key to each array.

And with lc used once per row - notice the munged data values:

#!/usr/bin/env perl

use strict;
use warnings;

my @all_matches = (
    [ "chpt10_2", "sent. 2",  "alice", "nsubj", "animAls", "protect" ],
    [ "chpt12_1", "sent. 54", "bob",   "nsubj", "celLs",   "protect" ],
    [ "chpt25_4", "sent. 47", "carol", "nsubj", "plAnts",  "protect" ],
    [ "chpt34_1", "sent. 1",  "dave",  "nsubj", "cElls",   "protect" ],
    [ "chpt35_1", "sent. 2",  "eli",   "nsubj", "cells",   "protect" ],
    [ "chpt38_1", "sent. 1",  "fred",  "nsubj", "Animals", "protect" ],
    [ "chpt54_1", "sent. 1",  "greg",  "nsubj", "uticle",  "protect" ],
    );

my %counts;
foreach my $rowref (@all_matches)
{
    push @$rowref, lc($rowref->[4]);
    $counts{$rowref->[6]}++;
}

@all_matches = sort { $counts{$b->[6]} <=> $counts{$a->[6]} || $a->[6] cmp $b->[6]
                    } @all_matches;

my $i = 0;
foreach my $rowref (@all_matches)
{
    $i++;
    print "$i";
    printf " %-9s", $_ foreach (@$rowref);
    print "\n";
}

Output:

1 chpt12_1  sent. 54  bob       nsubj     celLs     protect   cells    
2 chpt34_1  sent. 1   dave      nsubj     cElls     protect   cells    
3 chpt35_1  sent. 2   eli       nsubj     cells     protect   cells    
4 chpt10_2  sent. 2   alice     nsubj     animAls   protect   animals  
5 chpt38_1  sent. 1   fred      nsubj     Animals   protect   animals  
6 chpt25_4  sent. 47  carol     nsubj     plAnts    protect   plants   
7 chpt54_1  sent. 1   greg      nsubj     uticle    protect   uticle

Try this one:

my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
        ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
        ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
        ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
        ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
        ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
        ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
        );

my %wordcount;

foreach my $row (@all_matches) {
        $wordcount{$row->[4]}++;
}

my @sorted = sort { $wordcount{$b->[4]} <=> $wordcount{$a->[4]}  } @all_matches;

继续阅读：perl sorting

Better way to sort by frequency in 2D array?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？