Find all possible starting positions of a regular expression match in perl, including overlapping matches?
Is there a way to find all possible start positions for a regex match in perl?
开发者_开发问答For example, if your regex was "aa" and the text was "aaaa", it would return 0, 1, and 2, instead of, say 0 and 2.
Obviously, you could just do something like return the first match, and then delete all characters up to and including that starting character, and perform another search, but I'm hoping for something more efficient.
Use lookahead:
$ perl -le 'print $-[0] while "aaaa" =~ /a(?=a)/g'
In general, put everything except the first character of the regex inside of the (?=...)
.
Update:
I thought about this one a bit more, and came up with this solution using an embedded code block, which is nearly three times faster than the grep
solution:
use 5.010;
use warnings;
use strict;
{my @pos;
my $push_pos = qr/(?{push @pos, $-[0]})/;
sub with_code {
my ($re, $str) = @_;
@pos = ();
$str =~ /(?:$re)$push_pos(?!)/;
@pos
}}
and for comparison:
sub with_grep { # old solution
my ($re, $str) = @_;
grep {pos($str) = $_; $str =~ /\G(?:$re)/} 0 .. length($str) - 1;
}
sub with_while { # per Michael Carman's solution, corrected
my ($re, $str) = @_;
my @pos;
while ($str =~ /\G.*?($re)/) {
push @pos, $-[1];
pos $str = $-[1] + 1
}
@pos
}
sub with_look_ahead { # a fragile "generic" version of Sean's solution
my ($re, $str) = @_;
my ($re_a, $re_b) = split //, $re, 2;
my @pos;
push @pos, $-[0] while $str =~ /$re_a(?=$re_b)/g;
@pos
}
Benchmarked and sanity checked with:
use Benchmark 'cmpthese';
my @arg = qw(aa aaaabbbbbbbaaabbbbbaaa);
my $expect = 7;
for my $sub qw(grep while code look_ahead) {
no strict 'refs';
my @got = &{"with_$sub"}(@arg);
"@got" eq '0 1 2 11 12 19 20' or die "$sub: @got";
}
cmpthese -2 => {
grep => sub {with_grep (@arg) == $expect or die},
while => sub {with_while (@arg) == $expect or die},
code => sub {with_code (@arg) == $expect or die},
ahead => sub {with_look_ahead(@arg) == $expect or die},
};
Which prints:
Rate grep while ahead code
grep 49337/s -- -20% -43% -65%
while 61293/s 24% -- -29% -56%
ahead 86340/s 75% 41% -- -38%
code 139161/s 182% 127% 61% --
I know you asked for a regex, but there is actually a simple builtin function that does something quite similar, the function index
(perldoc -f index
). From that we can build up a simple solution to your direct question, though if you really need a more complicated search than your example this will not work as it only looks for substrings (after an index given by the third parameter).
#!/usr/bin/env perl
use strict;
use warnings;
my $str = 'aaaa';
my $substr = 'aa';
my $pos = -1;
while (1) {
$pos = index($str, $substr, $pos + 1);
last if $pos < 0;
print $pos . "\n";
}
You can use global matching with the pos()
function:
my $s1 = "aaaa";
my $s2 = "aa";
while ($s1 =~ /aa/g) {
print pos($s1) - length($s2), "\n";
}
精彩评论