Efficient substring matching in perl
I am looking for 开发者_如何学Pythonan efficient solution to do find the longest possible substring in a string tolerating n mismatches in the main string
Eg: Main String
- AGACGTACTACTCTACTAGATGCA*TACTCTAC*
- AGACGTACTACTCTACTAGATGCA*TACTCTAC*
- AGACGTACTACTCTACAAGATGCA*TACTCTAC*
- AGACGTACTACTTTACAAGATGCA*TACTCTAC*
Search String:
- TACTCTACT : this should be considered a match to all of the above main strings.
Also I there could be the case where part of the substring is at the end of main string and I would like to pick that up also.
I would appreciate if you could give some pointers.
PS: I will have one search string and about 100 million main strings to search the substring for.
Thanks! -Abhi
I'm not certain that this is what you're after but BioPerl has an approximate-grep tool called Bio::Grep::Backend::Agrep
:
Bio::Grep::Backend::Agrep searches for a query with agrep
And agrep
is "approximate grep". AFAIK, this builds a database and then uses that database to make your searches faster so this sounds like what you're after.
Looks like you're working with DNA sequences so you probably have BioPerl already installed.
You could also try using String::Approx
directly:
Perl extension for approximate matching (fuzzy matching)
I suspect that Bio::Grep::Backend::Agrep
will be faster and a better match for your task though.
use strict;
use warnings;
use feature qw( say );
sub match {
my ($s, $t, $max_x) = @_;
my $m = my @s = unpack('(a)*', $s);
my $n = my @t = unpack('(a)*', $t);
my @length_at_k = ( 0 ) x ($m+$n);
my @mismatches_at_k = ( 0 ) x ($m+$n);
my $offset = $m;
my $best_length = 0;
my @solutions;
for my $i (0..$m-1) {
--$offset;
for my $j (0..$n-1) {
my $k = $j + $offset;
if ($s[$i] eq $t[$j]) {
++$length_at_k[$k];
}
elsif ($length_at_k[$k] > 0 && $mismatches_at_k[$k] < $max_x) {
++$length_at_k[$k];
++$mismatches_at_k[$k];
}
else {
$length_at_k[$k] = 0;
$mismatches_at_k[$k] = 0;
}
my $length = $length_at_k[$k] + $max_x - $mismatches_at_k[$k];
$length = $i+1 if $length > $i+1;
if ($length >= $best_length) {
if ($length > $best_length) {
$best_length = $length;
@solutions = ();
}
push @solutions, $i-$length+1;
}
}
}
return map { substr($s, $_, $best_length) } @solutions;
}
say for match('AABBCC', 'DDBBEE', 2);
Output:
AABB
ABBC
BBCC
精彩评论