开发者

How can I match end-of-line multiple times in a regex without interpolation?

if I have a input with new lines in it like:

[INFO]
xyz
[INFO]

How can I pull out the xyz part using $ anchors? I tried a pattern like /^\[INFO\]$(.*?)$\[INFO\]/ms, but perl gives me:

Use of uninitialized value $\ in regexp compilation at scripts\t.pl line 6.

Is there a way to shut off interpolation so the anchors work as expected?

EDIT: The key is that the end-of-line anchor is a dollar sign but at times it may开发者_运维技巧 be necessary to intersperse the end-of-line anchor through the pattern. If the pattern is interpolating then you might get problems such as uninitialized $\. For instance an acceptable solution here is /^\[INFO\]\s*^(.*?)\s*^\[INFO\]/ms but that does not solve the crux of the first problem. I've changed the anchors to be ^ so there is no interpolation going on, and with this input I'm free to do that. But what about when I really do want to reference EOL with $ in my pattern? How do I get the regex to compile?


The question is academic--there's no need for the $ anchors in your regex anyway. You should be using \n to match the newlines, because the $ only matches the gap between the linefeed and the character before it.

EDIT: What I'm trying to say is that you will never need to use $ that way. Any match that spans from one line to the next will have to consume the line separator somehow. Consider your example:

/^\[INFO\]$(.*?)$\[INFO\]/ms

If this did compile, the (.*?) would start out by consuming the first linefeed and keep going until it had matched \nxyz, where the second $ would succeed. But the next character is a linefeed, and the regex is looking for [, so that doesn't work. After backtracking, the (.*?) would reluctantly consume one more character--the second linefeed--but then the $ would fail.

Any time you try to match an EOL with $ and then some more stuff, the first "stuff" you'll have to match will be the linefeed, so why not match that instead? That's why the Perl regex compiler tries to interpret $\ as a variable name in your regex: it makes no sense to have an end-of-line anchor followed by a character that's not a line separator.


Based on the answer in perlfaq6 - How can I pull out lines between two patterns that are themselves on different lines? , here's what a one-liner would look like:

perl -0777 -ne 'print $1,"\n" while /\[INFO\]\s*(.*?)\s*\[INFO\]/sg' file.txt

The -0777 switch slurps in the whole file at once.

However, if you're after a subroutine that gives you the flexibility to choose what tag you want to extract, the File::Slurp module makes things a little easier:

use strict;
use warnings;
use File::Slurp qw/slurp/;

sub extract {

    my ( $tag, $fileName ) = @_;
    my $text = slurp $fileName;

    my ($info) = $text =~ /$tag\s*(.*?)\s*$tag/sg;
    return $info;
}

# Usage:
extract ( qr/\[INFO\]/, 'file.txt' );


When regexes get too tricky, they probably are the wrong tool. I might consider using the flip flop operator here. It's false until its lefthand side is true, then stays true until its righthand side is true. That way, you can choose where to start and end the extraction just by looking at individual lines:

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

open my $string_fh, '<', \$string;

while( <$string_fh> )
    {
    next if /\[INFO]/ .. /\[INFO]/;
    chomp;

    print "Extracted <$_>\n";
    }

If you are using Perl 5.10, you can use the generalized line ending \R in a regex:

use 5.010;

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

my( $extracted ) = $string =~ /(?:\A|\R)\[INFO]\R(.*?)\R\[INFO]\R/;

print "Extracted <$extracted>\n";

Don't get hung up on the end-of-line anchor.


Maybe the /x modifier can help:

m/ ^\[INFO\] $ # Match INFO line
   \n
   ^ (.*?) $ # Collect desired line
   \n 
   ^ \[INFO\] # Match another INFO line
/xms

I haven't tested that, so you'd probably have to debug it. But I think this will prevent the $ symbols from interpolating as variables.


Although I've accepted Alan Moore's answer (Ryan Thompson's answer would also have done the trick too bad I could only accept one) I wanted to make perfectly clear the solution, as it was kind of buried in the comments and discussion. The following Perl script demonstrates that Perl is using the $ to interpolate variables if any character proceeds the dollar sign, and that turning off interpolation will allow the $ to be treated as EOL.

use strict;
use warnings;

my $x = "[INFO]\nxyz\n[INFO]";
if( $x =~ /^\[INFO\]$\n(.*?)$\n\[INFO\]/m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m'^\[INFO\]$\n(.*?)$\n\[INFO\]'m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n 
^ \[INFO\] # Match another INFO line
/xms ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

The script produces the following output:

Use of uninitialized value $\ in regexp compilation at t.pl line 5.
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
NO MATCH FOUND
'xyz' FOUND
'xyz' FOUND
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜