How to group series of consecutive number in Perl
I have a data input that looks like this:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 C - + *
seq 4924 C - + *
seq 4925 T - + *
seq 5252 A W +
seq 7400 T C -
seq 16710 C - - #
seq 18248 T C -
seq 18962 C - + *
seq 18963 A - + *
seq 18964 T - + *
seq 18965 A - + *
seq 19566 A M +
The input above is already sorted at 2nd column.
What I want to do is to:
- Only treat lines where 4th column is "-".
- If these lines contain consecutive positions (2nd column), group them
- Represent them as one new line with the lowest position as new position and the concatenation of grouped letters as new strings.
Hence we expect to get this output:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 CCT - + **
seq 5252 A W +
seq 7400 T C -
seq 16710 C - - #
seq 18248 T C -
seq 18962 CATA - + **
seq 19566 A M +
** Are the new lines/string formed by * line in first list (input)
# line is kept as it开发者_如何学JAVA is because there is no consecutive position after that.
I am stuck with the following logic, not sure how to proceed:
while ( <> ) {
chomp;
my @els = split(/\s+/,$_);
# Process indel
my @temp = ();
if ( $els[3] eq "-" ) {
push @temp, $_;
}
# How can I group them appropriately.
print Dumper \@temp ;
# And print accordingly to input ordering
}
This is a variation on control-break reporting. This code seems to do the job:
use strict;
use warnings;
my($prev) = -100;
my($grp0) = $prev;
my($col2, $col4);
sub print_group
{
my($grp0, $col2, $col3, $col4) = @_;
printf "seq %-5d %-4s %s %s\n", $grp0, $col2, $col3, $col4
if ($grp0 > 0);
}
while (<>)
{
chomp;
my @els = split(/\s+/,$_);
if ($els[3] ne "-")
{
print_group($grp0, $col2, "-", $col4);
print_group($els[1], $els[2], $els[3], $els[4]);
$prev = -100;
$grp0 = -100;
$col2 = "";
$col4 = "";
}
elsif ($els[1] == $prev + 1)
{
$grp0 = $prev if $grp0 < 0;
$prev = $els[1];
$col2 .= $els[2];
$col4 = $els[4];
}
else
{
print_group($grp0, $col2, "-", $col4);
$prev = $els[1];
$grp0 = $els[1];
$col2 = $els[2];
$col4 = $els[4];
}
}
print_group($grp0, $col2, $col4);
Example output:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 CCT - +
seq 5252 A W +
seq 7400 T C -
seq 16710 C - -
seq 18248 T C -
seq 18962 CATA - +
seq 19566 A M +
This is a more uniform output than the previous edition, but the basic logic is very much the same as before. The output is always generated by the same function, so everything is as uniform as possible.
It can be fiendishly difficult to get the conditions correct - it took several (far too many) iterations to get this code to produce the expected output.
精彩评论