开发者

How to group series of consecutive number in Perl

I have a data input that looks like this:

seq   75      T   G   - 
seq   3185    A   R   +
seq   3382    A   R   +
seq   4923    C   -   + *
seq   4924    C   -   + *
seq   4925    T   -   + *
seq   5252    A   W   +
seq   7400    T   C   -
seq   16710   C   -   - #
seq   18248   T   C   -
seq   18962   C   -   + *
seq   18963   A   -   + *
seq   18964   T   -   + *
seq   18965   A   -   + *
seq   19566   A   M   +

The input above is already sorted at 2nd column.

What I want to do is to:

  1. Only treat lines where 4th column is "-".
  2. If these lines contain consecutive positions (2nd column), group them
  3. Represent them as one new line with the lowest position as new position and the concatenation of grouped letters as new strings.

Hence we expect to get this output:

seq   75      T   G   -   
seq   3185    A   R   +
seq   3382    A   R   +
seq   4923    CCT   -   + **
seq   5252    A   W   +
seq   7400    T   C   -
seq   16710   C   -   - #
seq   18248   T   C   -
seq   18962   CATA   -   + **
seq   19566   A   M   +

** Are the new lines/string formed by * line in first list (input)
# line is kept as it开发者_如何学JAVA is because there is no consecutive position after that.

I am stuck with the following logic, not sure how to proceed:

while ( <> ) {
    chomp;

    my @els = split(/\s+/,$_);

    # Process indel
    my @temp = ();
    if ( $els[3] eq "-"  ) {
        push @temp, $_;
    }

     # How can I group them appropriately.
     print Dumper \@temp ;

     # And print accordingly to input ordering

}


This is a variation on control-break reporting. This code seems to do the job:

use strict;
use warnings;

my($prev) = -100;
my($grp0) = $prev;
my($col2, $col4);

sub print_group
{
    my($grp0, $col2, $col3, $col4) = @_;
    printf "seq   %-5d  %-4s  %s  %s\n", $grp0, $col2, $col3, $col4
        if ($grp0 > 0);
}

while (<>)
{
    chomp;
    my @els = split(/\s+/,$_);
    if ($els[3] ne "-")
    {
        print_group($grp0,   $col2,   "-",     $col4);
        print_group($els[1], $els[2], $els[3], $els[4]);
        $prev = -100;
        $grp0 = -100;
        $col2 = "";
        $col4 = "";
    }
    elsif ($els[1] == $prev + 1)
    {
        $grp0  = $prev if $grp0 < 0;
        $prev  = $els[1];
        $col2 .= $els[2];
        $col4  = $els[4];
    }
    else
    {
        print_group($grp0, $col2, "-", $col4);
        $prev = $els[1];
        $grp0 = $els[1];
        $col2 = $els[2];
        $col4 = $els[4];
    }
}

print_group($grp0, $col2, $col4);

Example output:

seq   75     T     G  -
seq   3185   A     R  +
seq   3382   A     R  +
seq   4923   CCT   -  +
seq   5252   A     W  +
seq   7400   T     C  -
seq   16710  C     -  -
seq   18248  T     C  -
seq   18962  CATA  -  +
seq   19566  A     M  +

This is a more uniform output than the previous edition, but the basic logic is very much the same as before. The output is always generated by the same function, so everything is as uniform as possible.

It can be fiendishly difficult to get the conditions correct - it took several (far too many) iterations to get this code to produce the expected output.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜