Retaining captures with the Perl substitution operator
Can someone explain why the following code...
#!/opt/local/bin/perl
use strict;
use warnings;
my $string;
$string = "\t\t\tEntry";
print "String: >$string<\n";
$string =~ s/^(\t*)//gi;
print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";
$string = "\t\t\tEntry";
$string =~ s/^(\t*)([^\t]+)/$2/gi;
print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";
exit 0;
...produces the following output...
String: > Entry<
Use of uninitialized value in concatenation (.) or string at ~/sandbox.pl line 12.
$1: ><
String: &g开发者_运维知识库t;Entry<
$1: > <
String: >Entry<
...or more directly: Why is the matched value in the first substitution not retained in $1?
I tried this on two implementations of Perl 5.12, and did not encounter the problem. 5.8 did.
Because you have the g
options, perl tries to match the pattern until it fails. See the debug output below.
So it doesn't work in Perl 5.8, but this does:
my $c1;
$string =~ s/^(\t*)/$c1=$1;''/ge;
Thus each time it matches, it saves it to $c1
.
This is what use re 'debug'
tells me:
Compiling REx `^(\t*)'
size 9 Got 76 bytes for offset annotations.
first at 2
1: BOL(2)
2: OPEN1(4)
4: STAR(7)
5: EXACT <\t>(0)
7: CLOSE1(9)
9: END(0)
anchored(BOL) minlen 0
Offsets: [9]
1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[0]
Compiling REx `^(\t*)([^\t]+)'
size 25 Got 204 bytes for offset annotations.
first at 2
1: BOL(2)
2: OPEN1(4)
4: STAR(7)
5: EXACTF <\t>(0)
7: CLOSE1(9)
9: OPEN2(11)
11: PLUS(23)
12: ANYOF[\0-\10\12-\377{unicode_all}](0)
23: CLOSE2(25)
25: END(0)
anchored(BOL) minlen 1
Offsets: [25]
1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[1] 0[0] 13[1] 8[5] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 15[0]
String: > Entry<
Matching REx `^(\t*)' against ` Entry'
Setting an EVAL scope, savestack=5
0 <> < Entry> | 1: BOL
0 <> < Entry> | 2: OPEN1
0 <> < Entry> | 4: STAR
EXACT <\t> can match 3 times out of 2147483647...
Setting an EVAL scope, savestack=5
3 < > <Entry> | 7: CLOSE1
3 < > <Entry> | 9: END
Match successful!
match pos=0
Use of uninitialized value in substitution iterator at - line 11.
Matching REx `^(\t*)' against `Entry'
Setting an EVAL scope, savestack=5
3 < > <Entry> | 1: BOL
failed...
Match failed
Freeing REx: `"^(\\t*)"'
Freeing REx: `"^(\\t*)([^\\t]+)"'
Because you are trying to match whitespace at the beginning of the line, you need neither the g
nor the i
. So it might be a case where you're trying to do something else.
I think version 5.10 and beyond, it only affects capture buffers if there was a match.
The interesting thing in your example, is that with $string =~ s/^(\t*)([^\t]+)/$2/gi;
it didin't reset the capture buffers. That appears to be because of a preamble that estimates
if the match should be tried. In this case, ([^\t]+)
consumed the entire string in the first
match, so a string too short
occured and the buffers were never reset.
I can't test it but $string =~ s/^(\t*)([^\t])//gi
should give the same warning.
if ( s///g ) {}
and testing of capture buffers in this case is not certain to contain
anything. This was the case in version 5.8. Even in later versions its really just a debug feature.
Edit @theracoon - on your comment: "I'm reasonably certain that ([^\t]+) did not actually consume the entire string. The output definitely does not reflect that."
This is a proof that your regex consumed the entire string on the first match, Pass 1.
There is nothing left to match on the second pass. That is the way the /g modifier works.
It tries to match the entire regex again, in the postion in the string where the last match left off.
use re 'debug';
$string = "\t\t\tEntry";
$string =~ s/^(\t*)([^\t]+)/$2/gi;
print "'$string'\n";
Pass 1 ..
Matching REx "^(\t*)([^\t]+)"
against "%t%t%tEntry"
8 <%t%t%tEntry
> <>
Match successful!
Pass 2 ..
Matching REx "^(\t*)([^\t]+)"
against ""
(Nope, nothing left to match)
String too short [regexec_flags]...
Match failed
'Entry'
精彩评论