Interleaving sparse sorted arrays
I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:
[[ do, re, fa, ti ],
[ do, re, mi ],
[ do, la, ti, za ],
[ mi, fa ],
[ re, so, za ]]
The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.
I'd like to be able to take that input and generate a sorted list of all events:
[ do, re, mi, fa, so, la, ti, za ]
or better yet, some i开发者_运维百科nformation about each event, like a count:
[ [do, 3], [re, 3], [mi, 2],
[fa, 2], [so, 1], [la, 1],
[ti, 1], [za, 2] ]
Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.
I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.
You can use tsort
to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in reading tsort
's original use, which is similar in structure to your problem.
Note that tsort
requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.
#! /usr/bin/perl
use warnings;
use strict;
use IPC::Open2;
sub tsort {
my($events) = @_;
my $pid = open2 my $out, my $in, "tsort";
foreach my $group (@$events) {
foreach my $i (0 .. $#$group - 1) {
print $in map "@$group[$i,$_]\n", $i+1 .. $#$group;
}
}
close $in or warn "$0: close: $!";
chomp(my @order = <$out>);
my %order = map +(shift @order => $_), 0 .. $#order;
wantarray ? %order : \%order;
}
Because you described the data as sparse, the code above provides tsort
with as much information as possible about the events' adjacency matrix.
Having that information, computing a histogram and sorting its components is straightforward:
my $events = [ ... ];
my %order = tsort $events;
my %seen;
do { ++$seen{$_} for @$_ } for @$events;
my @counts;
foreach my $event (sort { $order{$a} <=> $order{$b} } keys %seen) {
push @counts => [ $event, $seen{$event} ];
print "[ $counts[-1][0], $counts[-1][1] ]\n";
}
For the input in your question you provided, the output is
[ do, 3 ] [ la, 1 ] [ re, 3 ] [ so, 1 ] [ mi, 2 ] [ fa, 2 ] [ ti, 2 ] [ za, 2 ]
This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by $events
: we know only that they must both come after do.
Theoretically speaking, let me suggest the following algorithm:
- Build a directed graph.
- For each input [ X, Y, Z ], create the edges X->Y and Y->Z if they're not already there.
- Perform a topological sorting of the graph.
- Voila!
PS
This is only assuming that all events occur in a specific order (always!). If that's not the case, the problem becomes NP-Complete.
PPS
And just so that you have something useful: Sort::Topological (don't know if it actually works but it seems right)
If you're not into writing to much code, you could use the unix command-line utility tsort
:
$ tsort -
do re
re fa
fa ti
do re
re mi
do la
la ti
ti za
mi fa
re so
so za
Which is a list of all pairs in your sample input. This produces as output:
do
la
re
so
mi
fa
ti
za
which is basically what you want.
Use a hash to aggregate.
my $notes= [[qw(do re fa ti)],
[qw(do re mi)],
[qw(do la ti za)],
[qw(mi fa)],
[qw(re so za)]];
my %out;
foreach my $list (@$notes)
{
$out{$_}++ foreach @$list;
}
print "$_: $out{$_}\n" foreach sort keys %out;
Yields
do: 3
fa: 2
la: 1
mi: 2
re: 3
so: 1
ti: 2
za: 2
The %out hash is easily converted into a list if that is what you want.
my @newout;
push @newout,[$_,$out{$_}] foreach sort keys %out;
perl -de 0
DB<1> @a = ( ['a','b','c'], ['c','f'], ['h'] )
DB<2> map { @m{@{$_}} = @$_ } @a
DB<3> p keys %m
chabf
Quickiest shortcut I can think of. Either way, you have to iterate through things at least once...
This is a perfect candidate for a Merge Sort. Go to the wikipedia page here for a pretty good representation of the algorithm http://en.wikipedia.org/wiki/Merge_sort
What you have described is actually a subset/small tweak of the merge sort. Instead of starting with an unsorted array, you have a set of sorted arrays that you want to merge together. Just call the "merge" function as described in the wikipedia page on pairs of your arrays and the results of the merge function until you have a single array (which will be sorted).
To tweak the output to the way you want, you'll need to define a comparison function that can return if one event is less than, equal to, or greater than a different event. Then, when your merge function finds two events that are equal, you can collapse them into a single event and keep a count for that event.
Roughly, the name I would give it is "hashing". You are putting things into name value pairs. If you want to keep some semblance of order, you have to supplement the hash with an array that keeps order. That order is "encounter order" for me.
use strict;
use warnings;
my $all
= [[ 'do', 're', 'fa', 'ti' ],
[ 'do', 're', 'mi' ],
[ 'do', 'la', 'ti', 'za' ],
[ 'mi', 'fa' ],
[ 're', 'so', 'za' ]
];
my ( @order, %counts );
foreach my $list ( @$all ) {
foreach my $item ( @$list ) {
my $ref = \$counts{$item}; # autovivs to an *assignable* scalar.
push @order, $item unless $$ref;
$$ref++;
}
}
foreach my $key ( @order ) {
print "$key: $counts{$key}\n";
}
# do: 3
# re: 3
# fa: 2
# ti: 2
# mi: 2
# la: 1
# za: 2
# so: 1
There are other answers like this one, but mine contains this neat autovivification trick.
I'm not really sure what this would be called either, but I figured out a way to find the order given the array of arrays as an input. Essentially the pseudo-code is:
10 Find earliest item in all arrays
20 Push that onto a list
30 Remove that item from all arrays
40 Goto 10 if there are any items left
Here's a working prototype:
#!/usr/bin/perl
use strict;
sub InList {
my ($x, @list) = @_;
for (@list) {
return 1 if $x eq $_;
}
return 0;
}
sub Earliest {
my @lists = @_;
my $earliest;
for (@lists) {
if (@$_) {
if (!$earliest
|| ($_->[0] ne $earliest && InList($earliest, @$_))) {
$earliest = $_->[0];
}
}
}
return $earliest;
}
sub Remove {
my ($x, @lists) = @_;
for (@lists) {
my $n = 0;
while ($n < @$_) {
if ($_->[$n] eq $x) {
splice(@$_,$n,1);
}
else {
$n++
}
}
}
}
my $list = [
[ 'do', 're', 'fa', 'ti' ],
[ 'do', 're', 'mi' ],
[ 'do', 'la', 'ti', 'za' ],
[ 'mi', 'fa' ],
[ 're', 'so', 'za' ]
];
my @items;
while (my $earliest = Earliest(@$list)) {
push @items, $earliest;
Remove($earliest, @$list);
}
print join(',', @items);
Output:
do,re,mi,fa,la,ti,so,za
Just realized your question said their is no predetermined order, so this may not be relevent.
Perl code:
$list = [
['do', 're', 'fa', 'ti' ],
['do', 're', 'mi' ],
['do', 'la', 'ti', 'za' ],
['mi', 'fa' ],
['re', 'so', 'za' ]
];
%sid = map{($_,$n++)}qw/do re mi fa so la ti za/;
map{map{$k{$_}++}@$_}@$list;
push @$result,[$_,$k{$_}] for sort{$sid{$a}<=>$sid{$b}}keys%k;
print "[@$_]\n" for(@$result);
output:
[do 3]
[re 3]
[mi 2]
[fa 2]
[so 1]
[la 1]
[ti 2]
[za 2]
Solution:
This solves the original question before it was modified by the asker.
#!/usr/local/bin/perl -w
use strict;
main();
sub main{
# Changed your 3-dimensional array to a 2-dimensional array
my @old = (
[ 'do', 're', 'fa', 'ti' ],
[ 'do', 're', 'mi' ],
[ 'do', 'la', 'ti', 'za' ],
[ 'mi', 'fa' ],
[ 're', 'so', 'za' ]
);
my %new;
foreach my $row (0.. $#old ){ # loop through each record (row)
foreach my $col (0..$#{$old[$row]} ){ # loop through each element (col)
$new{ ${$old[$row]}[$col] }{count}++;
push @{ $new{${$old[$row]}[$col]}{position} } , [$row,$col];
}
}
foreach my $key (sort keys %new){
print "$key : $new{$key} " , "\n"; # notice each value is a hash that we use for properties
}
}
How to Retrieve Info:
local $" = ', '; # pretty print ($") of array in quotes
print $new{za}{count} , "\n"; # 2 - how many there were
print "@{$new{za}{position}[1]} \n"; # 4,2 - position of the second occurrence
# remember it starts at 0
Basically, we create a unique list of elements in the hash. For each of those elements we have a "property" hash, that contains a scalar count
and an array for the position
. The number of elements in the array should vary, based on how many occurrences of the element were in the original.
The scalar property isn't really necessary since you could always take the scalar of the position
array to retrieve the same number. Note: if you ever add/remove elements from the array count
and position
will not be correlate in their meaning.
- example:
print scalar @{$new{za}{position}};
will give you the same asprint $new{za}{count};
精彩评论