开发者

How to get an evenly distributed sample from Perl array values?

I have an array containing many values between 0 and 360 (like degrees in a circle), but unevenly distributed:

1,45,46,47,48,49,50,51,52,53,54,55,100,120,140,188, 210, 280, 355

Now I need to reduce those value开发者_StackOverflows to e.g. 4 only, but as evenly as possible distributed values.

How to do that?

Thanks, Jan


Put the numbers on a circle, like a clock. Now construct a logical cross, say at 12, 3, 6, and 9 o’clock. Put the 12 at the first number. Now find what numbers would be nearest to 3, 6, and 9 o’clock, and record the sum of those three numbers’ distances next to the first number.

Iterate by rotating the top of your cross — the 12 o’clock point — clockwise until it exactly lines up with the next number. Again measure how far the nearest numbers are to each of your three other crosspoints, and record that score next to this current 12 o’clock number.

Repeat until you reach your 12 o’clock has rotated all the way to the original 3 o’clock, at which point you’re done. Whichever number has the lowest sum assigned to it determines the winning configuration.

This solution generalizes to any range of values R and any number N of final points you wish to reduce the set to. Each point on the “cross” is R/N away from each other, and you need only rotate until the top of your cross reaches where the next arm was in the original position. So if you wanted 6 points, you would have a 6-pointed cross, each 60 degrees apart instead of a 4-pointed cross each 90 degrees apart. If your range is different, you still do the same sort of operation. That way you don’t need a physical clock and cross to implement this algorithm: it works for any R and N.

I feel bad about this answer from a Perl perspective, as I’ve not managed to include any dollar signs in the solution. :)


Use a clustering algorithm to divide your data into evenly distributed partitions. Then grab a random value from each cluster. The following $datafile looks like this:

1   1
45  45
46  46
...
210 210
280 280
355 355

First column is a tag, second column is data. Running the following with $K = 4:

use strict; use warnings;
use Algorithm::KMeans;

my $datafile = $ARGV[0] or die;
my $K        = $ARGV[1] or 0;
my $mask     = 'N1';

my $clusterer = Algorithm::KMeans->new(
    datafile => $datafile,
    mask     => $mask,
    K        => $K,
    terminal_output => 0,
);

$clusterer->read_data_from_file();

my ($clusters, $cluster_centers) = $clusterer->kmeans();

my %clusters;

while (@$clusters) {

    my $cluster = shift @$clusters;
    my $center  = shift @$cluster_centers;

    $clusters{"@$center"} = $cluster->[int rand( @$cluster - 1)];
}

use YAML; print Dump \%clusters;

returns this:

120: 120
199: 188
317.5: 355
45.9166666666667: 46

First column is the center of the cluster, second is the selected value from that cluster. The centers' distance to one another should be maximized according to the Expectation Maximization algorithm.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜