开发者

Clustering By Interval Via Hash of Array in Perl

I have a data that looks like this

#Status value
TP       5.000
TP       3.000
TP       3.000
TN       10.000
TP       2.000
TP       9.000
TN       1.000
TP       9.000
TN       1.000

What we want to do is to cluster the Status based on the given interval in value. Let that interval be 1-3, 4-6, 7-9, 10-12, etc .. (i.e. Bin size 3).

We hope to get the hash of array like this:

my %hoa = (
'1-3' => [TP,TP,TP,TN,TN],
'4-6' => [TP],
'7-9' => [TP,TP],
'10-12' => [TN]);

What's the 开发者_开发问答way to achieve that?

Update: Corrected the HoA for 7-9, thanks to ysth.


Abstracting away the code to determine interval:

sub interval {
    my ($val) = @_;
    my $i = int( ( $val + 2 ) / 3 );
    my $interval = sprintf( '%d-%d', $i * 3 -2, $i * 3 );
    return $interval;
}

my %hoa;
while ( my $line = <> ) {
    next if $line =~ /^#/;
    my ($status, $value) = split ' ', $line;
    push @{ $hoa{ interval($value) } }, $status;
}

use Data::Dumper;
print Dumper \%hoa;

(which gets two TPs for 7-9, not one as you show).


ysth's answer was the first thing that occurred to me as well, and I think he has the right approach.

I'd just like to leave a suggestion: you could use a clustering algorithm to do this for you in a future-proof kind of way (say, when your data becomes multidimensional). K-means, for example, would work fine, even for 1D data such as yours.

For example:

use strict; use warnings;
use Algorithm::KMeans;

my $datafile = $ARGV[0] or die;
my $K        = $ARGV[1] or 0;
my $mask     = 'N1';

my $clusterer = Algorithm::KMeans->new(
    datafile => $datafile,
    mask     => $mask,
    K        => $K,
    terminal_output => 0,
);

$clusterer->read_data_from_file();

my ($clusters, $cluster_centers) = $clusterer->kmeans();

my %clusters;

while (@$clusters) {

    my $cluster = shift @$clusters;
    my $center  = shift @$cluster_centers;

    $clusters{"@$center"} = $cluster;
}

use YAML; print Dump \%clusters;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜