开发者

Why doesn't my Perl regex match what I think it should?

I tried the following code snippet from Robert's Perl tutorial (link text):

> $_='My email address is
> <webslave@work.com>.';
> 
> print "Found it ! :$1:" if /(<*>)/i;

When I ran it, the output was:

Found it ! :>:

However, shouldn't the output be,

Found it ! :m>:

since 'm' matches "0 or more '<' i.e the '<*' part of the regex"

Also,

$_='My email address is <webslave@work.com>.';
print "Match 1 worked :$1:" if /(<*)/i;

When this is run the output is:

Mat开发者_StackOverflow中文版ch 1 worked ::

$_='<My email address is <webslave@work.com>.';
print "Match 2 worked :$1:" if /(<*)/i;

When the above is run, the output is:

Match 2 worked :<:

But shouldn't the output be:

Match 2 worked ::

since the first match (i.e. $1) is "" rather than "<", like the example before it.


if /(<*>)/i;

will match 0 or more < chars, followed immediately by a > char...

so the only possible match is the > char which is preceeded by 0 < chars.


The answer to your first question is simple, you're wrong.

The second question is rather interesting, to understand this you need to know two facts:

  1. Once there's a successful match, the regular expression will stop matching and return the result it believes successful.
  2. The standard quantifiers (* + ? and {min, max}) are greedy. which means, /<*/ will match as much <<<<<... as possible.

So, back to the regex /<*/. When matching

My email address is <webslave@work.com>.

The very beginning of the string, ^, matches the regex, which results an empty string. This is a successful match, and the next step, ^M, does not match your regex. so voila, perl will stop matching and give your the empty result.

Then come to second string

<My email address is <webslave@work.com>.

The very beginning of the string, ^, matches the regex, which results an empty string. But, the next step, ^<, still matches your regex. and quntifier * is greedy. It will match as much as possible. So results in an <.


With $1 you access the first "capture" of the regex, with a capture being what's put between brackets. In your example I think you're missing a . <*> matches zero or more '<' characters followed by a '>' character, so here it matches zero '<' and one '>'. It probably should read like this:

print "Found it ! :$1:" if /(<.*>)/i;

Now this matches a '<' followed by zero or more arbitrary characters ('.' matches any character), followed by '>'.


Regular expressions in Perl work a bit differently than wildcards in many OS applications.

The * means "0 or more of the previous thing". So when you do

<*>

IT means

"Zero or more less than characters, followed by a greater than character."

What you want is the regular expression user's best friend: .

<.*>

That means

"a less than character, followed by ANYTHING 0 or more times, followed by a greater than character."

But that's probably not what you mean either: the > character is also "any character"! Fortunately, there's an easy way of saying what you really mean you make * no longer greedy with the ? character:

<.*?>

This means, "The less than character, followed by anything, 0 or more times, UNTIL I reach a > character."

Woo!

There's a few great websites out there that will get you familiar with the great world of regexes, and one of my favorite is regular expressions.info. For perl specific regexes, though, you can't beat the classic Perl Regular Expressions Tutorial. The perl regular expressions tutorial has guided many a regex wanderers to the Perl homeland, and is a great resource.


Personally I'm very fond of the cheat sheet at Added Bytes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜