开发者

How can I exclude the part of the string that matches a Perl regular expression?

I have to file that has different types of lines. I want to select only those lines that have an user-agent. I know that the line that has this is something like this.

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16

So, I want to identify the line that starts with the string "User-Agent", but after that I want to process the rest of the line excluding this string. My question is does Perl store the remaining string in any special variable that I c开发者_如何转开发an use to process further? So, basically I want to match the line that starts with that string but after that work on the rest of it excluding that string.

I search for that line with a simple regexp

/^User-Agent:/


The substr solution:

my $start = "User-Agent: ";

if ($start eq substr $line, 0, length($start)) {
    my $remainder = substr $line, length($start);
}


if ($line =~ /^User\-Agent\: (.*?)$/) {
    &process_string($1)
}


(my $remainder = $str) =~ s/^User-Agent: //;


You could use the $' variable, but don't--that adds a lot of overhead. Probably just about as good--for the same purposes--is @+ variable or, in English, @LAST_MATCH_END.

So this will get you there:

use English qw<@LAST_MATCH_END>;

my $value = substr( $line, $LAST_MATCH_END[0] );


Perl 5.10 has a nice feature that allows you to get the simplicity of the $' solutions without the performance problems. You use the /p flag and the ${^POSTMATCH} variable:

 use 5.010;
 if( $string =~ m/^User-Agent:\s+/ip ) {
      my $agent = ${^POSTMATCH};
      say $agent;
      }

There are some other tricks though. If you can't use Perl 5.010 or later, you use a global match in scalar context, the value of pos is where you left off in the string. You can use that position in substr:

 if( $string =~ m/^User-Agent:\s+/ig ) {
      my $agent = substr $string, pos( $string );
      print $agent, "\n";
      }

The pos is similar to the @+ trick that Axeman shows. I think I have some examples with @+ and @- in Mastering Perl in the first chapter.

With Perl 5.14, which is coming soon, there's another interesting way to do this. The /r flag on the s/// does a non-destructive substitution. That is, it matches the bound string but performs the substitution on a copy and returns the copy:

use 5.013;  # for now, but 5.014 when it's released
my $string = 'User-Agent: Firefox';
my $agent = $string =~ s/^User-Agent:\s+//r;
say $agent;

I thought that /r was silly at first, but I'm really starting to love it. So many things turn out to be really easy with it. This is similar to the idiom that M42 shows, but it's a bit tricky because the old idiom does an assignment then a substitution, where the /r feature does a substitution then an assignment. You have to be careful with your parentheses there to ensure the right order happens.

Note in this case that since the version is Perl 5.12 or later, you automatically get strictures.


You can use $' to capture the post-match part of the string:

if ( $line =~ m/^User-Agent: / ) {
    warn $';
}

(Note that there's a trailing space after the colon there.)

But note, from perlre:

WARNING: Once Perl sees that you need one of $& , $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) But if you never use $& , $` or $' , then patterns without capturing parentheses will not be penalized. So avoid $& , $' , and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.


Use $' to get the part of the string to the right of the match.

There is much wailing and gnashing of teeth in the other answers about the "considerable performance penalty" but unless you actually know that your program is rich in use of regular expressions, and that you have a performance problem, I wouldn't worry about it.

We worry too often about optimizations that have little-to-no impact on the actual code. Chances are, this is one of them, too.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜