开发者

Efficient substring matching in perl

I am looking for 开发者_如何学Pythonan efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string

Eg: Main String

  1. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  2. AGACGTACTACTCTACTAGATGCA*TACTCTAC*
  3. AGACGTACTACTCTACAAGATGCA*TACTCTAC*
  4. AGACGTACTACTTTACAAGATGCA*TACTCTAC*

Search String:

  1. TACTCTACT : this should be considered a match to all of the above main strings.

Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.

I would appreciate if you could give some pointers.

PS: I will have one search string and about 100 million main strings to search the substring for.

Thanks! -Abhi


I'm not certain that this is what you're after but BioPerl has an approximate-grep tool called Bio::Grep::Backend::Agrep:

Bio::Grep::Backend::Agrep searches for a query with agrep

And agrep is "approximate grep". AFAIK, this builds a database and then uses that database to make your searches faster so this sounds like what you're after.

Looks like you're working with DNA sequences so you probably have BioPerl already installed.

You could also try using String::Approx directly:

Perl extension for approximate matching (fuzzy matching)

I suspect that Bio::Grep::Backend::Agrep will be faster and a better match for your task though.


use strict;
use warnings;
use feature qw( say );

sub match {
   my ($s, $t, $max_x) = @_;

   my $m = my @s = unpack('(a)*', $s);
   my $n = my @t = unpack('(a)*', $t);

   my @length_at_k     = ( 0 ) x ($m+$n);
   my @mismatches_at_k = ( 0 ) x ($m+$n);
   my $offset = $m;

   my $best_length = 0;
   my @solutions;
   for my $i (0..$m-1) {
      --$offset;

      for my $j (0..$n-1) {
         my $k = $j + $offset;

         if ($s[$i] eq $t[$j]) {
            ++$length_at_k[$k];
         }
         elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
            ++$length_at_k[$k];
            ++$mismatches_at_k[$k];
         }
         else {
            $length_at_k[$k] = 0;
            $mismatches_at_k[$k] = 0;
         }

         my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
         $length = $i+1 if $length > $i+1;

         if ($length >= $best_length) {
            if ($length > $best_length) {
               $best_length = $length;
               @solutions = ();
            }

            push @solutions, $i-$length+1;
         }
      }
   }

   return map { substr($s, $_, $best_length) } @solutions;
}

say for match('AABBCC', 'DDBBEE', 2);

Output:

AABB
ABBC
BBCC
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜