开发者

Finding Base Overlap Counts and Internal Gap in Two Strings

I have this two strings of equal length, which I need开发者_如何学Python to compare. I want to find overlap base(.) and internal gap (*). Below is the example:

------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
      ................**.................

Number of overlap = 33. Number of internal gap = 2.

I have no problem finding the number of overlap. But I have problem finding internal gap. Below is the current code I have. It is horribly slow. In principle I need to compute millions of such pairs.

#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

print "$s1\n";
print "$s2\n";


my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);

my $ovlp_basecount = 0;
my $internal_gap = 0;

foreach my $si ( 0 .. length($s1)  ) {


    my $base1 = substr($s1,$si,1);
    my $base2 = substr($s2,$si,1);


    # Overlap
    if ( $base{$base1} && $base{$base2} ) {
        $ovlp_basecount++;
    }

    # Not sure how to compute internal gap

}


print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";

Please advice how can I find internal gap and overlap efficiently.


You can use a bitwise OR on the strings to find the the areas in one string that overlap blank areas in the other. This process also has the effect of revealing the overlap by converting non-overlapping characters to lower case, thus making finding the overlap quite simple too:

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;

print "s1:      '$s1'\n";
print "s2:      '$s2'\n";
print "OR:      '$or'\n";
printf "Gap:     '%s' (%d)\n", $gap,     length $gap;
printf "Overlap  '%s' (%d)\n", $overlap, length $overlap;

Prints:

s1:      '      ACTAAAAATACAAAAA  TTAGCCAGGCGTGGTGGCAC'
s2:      '     TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG   '
OR:      '     tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap:     'aa' (2)
Overlap  'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)

For more information on string bitwise operations:

http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt


Assuming the gaps never overlap, you can solve this using regular expressions. Here's an answer for your s1.

echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜