开发者

What's the difference between * and ? in regular expressions?

Both seem to mean, match 0 or more characters? I don't understand the difference between them, or when to use ? and when to use 开发者_如何学Python*. Some examples would help.


  • In the Formal definition the symbols of regular expressions operators are

    . : which is concatenation like a.b.c would match a text having abc . Some times to indicate concatenation simply two symbols are used back to back.

    * : match the last symbol 0 or more times, (abc)* would match a null string, abc, abcabc, abcabcabc, but not abcaabc. Known as the Kleen's star.

    + : would match either the left-hand side or the right hand side . (abc + def) would match abc or def. Also the union operator or the | operator is used.

    These are applied on a set of symbols sigma, which includes the symbols in your language within other special symbols are the epsilon which denotes the empty string, and the null means no symbols at all. For details see [3]

    These are the formal definitions.

When you use applications accepting the POSIX regular expression syntax the meaning of the different operators are like this:

  • These are the POSIX Basic regular expression operations

    . : The dot '.' matches any character like a.c could match abc, axc, amc, aoc anything.

    ^ : Indicates the start of line. ^abc would match the string which is starting at the line. abc appearing in between the line would not be matched

    $ : Indicates the end of line. abc$ would only match the string abc at the end of the line. This would not match any 'abc' in between the lines.

    * : Matches the last symbol preceding the '*' 0 or more times. So ab*c would match ac, abc, abbc, abbbc, abbbbbc, abbbbbbbbc etc.

    {m, n} : Matches the the preceding symbol atleat 'm' times and at most 'n' times. ab{2,4}c would not match 'abc', but would match 'abbc', 'abbbc', 'abbbbc', but will not match 'abbbbbc' . So if the number of 'b' is >= 2 and <= 4 it would match.

    {m,} : means match the preceding symbol minimum 'm' times, and no limit in the maximum. (note the comma)

    {n} : means match the preceding symbol exactly 'n' times. so ab{3}c would only match 'abbbc'.

    [symbols] : will match any one of the symbols inside the box braces. like a[xyz]c would match 'axc' , 'ayc', and 'azc' and no other strings

    [^symbols] : will match any symbol once which are not inside the box brackets. like a[xyz]c would match any strings 'a.b' with the '.' being any symbol except x, y, z.

  • These are the POSIX Extended regular expression operators (needs grep -E)

    ? : Will match the preceding symbol 0 or at most 1 time. so ab?c would match 'ac' and 'abc' only.

    + : Will match the preceding symbol at least 1 time and at most any number times (no upper bounds). Like ab+c would match abc, abbbc, abbbbbc, abbbbbbbbc, etc, but would not match 'ac'

    | : Would match either the expression on the left side of the '|' or the right side expression on the right side of the '|'. like (ab+c)|(xy*z) .

  • Also have a look at the POSIX meta character classes like [:alpha:] represents all the alphabets. [:punct:] denotes all the punctuations etc.

  • Wild Characters/ Globs If you are using * and ? as wild cards then the interpretations are as below

    * : Match any number of any characters at this position. Like *.c means all strings ending with the string '.c' (here . has no special interpretations) . Test with ls *.c or ls *.doc

    ? : Match any character only one time at this position. Like file??.txt would match strings 'fileab.c', 'file00.c' etc, and match any exactly two characters. Test with ls *.??? which will list all the files having a three character extension.

I hope this answers your question. Or you might want to through some text about formal definitions and the POSIX and maybe the Perl style regular expressions for a clear idea.

References: Wikipedia Page

grep manual Regular expression section

Theory of Computation by Michael Sipser

Note: This answer was reconstructed


? means zero or one of. * means any number of. So this:

^ab?$

would match a and ab, but not abb. This:

^ab*$

would match not only a and ab, but also abb, abbb, and a with any number of bs following it.


For sake of regex completeness something like *? is also used. In this case it is a lazy match or non-greedy match and will match as few characters as possible before matching the next token.

For example:

a.*a

would match whole of abaaba

while

a.*?a

will match aba, aba

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜