开发者

Pattern matches an hyphen too

I have a piece of Perl code (pattern matching) like this,

$var = "<AT>this is an at command</AT>";

if ($var =~ /<AT>([\s\w]*)<\/AT>/i)
{
    print "Matched in AT command\n";
    print "$var\n\n";
}

It works fine, if the content inbetween tags are without an Hyphen. It is not working if a hyphen is inserted between the string present inbetween tags like this... <AT>this is an at-command</AT>.

Can any one fix this regex to match even if hyphen is also inserted ??

help me pls

Senthil开发者_C百科


On character class

Your pattern contains this subpattern:

[\s\w]*

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

\s is the shorthand for whitespace character class; \w for word character class. Neither contains the hyphen.

The * is the zero-or-more repetition specifier.

Now you should understand why this pattern does not match a hyphen: it matches zero-or-more of characters that is either a whitespace or a word character. If you want to match a hyphen, then you can include it into the character class.

[\s\w-]*

If you also want to include the period, question mark, and exclamation mark, for example, then you can simply add them in as well:

[\s\w.!?-]*

Special note on hyphen

BE CAUTIOUS when including the hyphen in a character class. It is used as a regex metacharacter in character class definition to define character range. For example,

[a-z]

matches one of any character the range between 'a' and 'z', inclusive. By contrast,

[az-]

matches one of exactly 3 characters, 'a', 'z', and '-'. When you put - as the last element in a character class, it becomes a literal hyphen instead of range definition. You can also put it as the first element, or escape it (by preceding with backslash, which is the way you escape all other regex metacharacters too).

That is, the following 3 character class are identical:

[az-]         [-az]         [a\-z]

Related questions

  • Regex: why doesn't [01-12] range work as expected?


You can just add a hyphen in the char class as:

if ($var =~ /<AT>([\s\w-]*)<\/AT>/i)

Also since your regex has a / in it you can use a different delimiter, this way you can avoid escaping /:

if ($var =~m{<AT>([\s\w-]*)</AT>}i)


Use \S instead of \w.

if ($var =~ /<AT>([\s\S]*)<\/AT>/i) {


If you want to have everything between and you can use

if ($var =~ /<AT>((?:(?!<AT>).)*)<\/AT>/i)

And it's ungreedy.


You need to add more characters to your class like [\s\w-]* (as codaddict told you).

Moreover, you should maybe use a lookahead to match the end of your command ("I want to match that only if it is followed by the ending statement") like :

if ($var =~ /<AT>([^<]*)(?=<\/AT>)/i)

[^<] stands for "any character (including hyphen) except "<".

You could even add a lookbehind :

if ($var =~ (?<=/<AT>)([^<]*)(?=<\/AT>)/i)

For more complexe things (since you seem to want a little parser), you should look at the theory of grammar and at lex/yacc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜