开发者

Regex that matches xa?b?c? but not x alone

I'm trying to write a regex that matches xa?b?c? but not x. In reality开发者_运维知识库, 'x', 'a', 'b', and 'c' are not single characters, they are moderately complex sub-expressions, so I'm trying to avoid something like x(abc|ab|ac|bc|a|b|c). Is there a simple way to match "at least one of a, b, and c, in that order" in a regex, or am I out of luck?


Here’s the shortest version:

(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

If you need to keep around the match in a separate group, write this:

((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))

But that isn’t very robust in case a, b, or c contain capture groups. So instead write this:

(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))

And if you need a group for the whole match, then write this:

(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))

And if like me you prefer multi-lettered identifiers and also think this sort of thing is insane without being in /x mode, write this:

(?x)
(?<Whole_Match>
    (?<Group_A> a) ?
    (?<Group_B> b) ?  
    (?<Group_C> c) ?

    (?(<Group_A>)           # Succeed 
      | (?(<Group_B>)       # Succeed
          | (?(<Group_C>)   # Succeed
              |             (*FAIL)
            )
        )
    )
 )

And here is the full testing program to prove that those all work:

#!/usr/bin/perl
use 5.010_000;

my @pats = (
    qr/(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))/,
    qr/((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))/,
    qr/(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))/,
    qr/(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))/,
    qr{
        (?<Whole_Match>

            (?<Group_A> a) ?
            (?<Group_B> b) ?
            (?<Group_C> c) ?

            (?(<Group_A>)               # Succeed
              | (?(<Group_B>)           # Succeed
                  | (?(<Group_C>)       # Succeed
                      |                 (*FAIL)
                    )
                )
            )

        )
    }x,
);

for my $pat (@pats) {
    say "\nTESTING $pat";
    $_ = "i can match bad crabcatchers from 34 bc and call a cab";
    while (/$pat/g) {
        say "$`<$&>$'";
    }
}

All five versions produce this output:

i <c>an match bad crabcatchers from 34 bc and call a cab
i c<a>n match bad crabcatchers from 34 bc and call a cab
i can m<a>tch bad crabcatchers from 34 bc and call a cab
i can mat<c>h bad crabcatchers from 34 bc and call a cab
i can match <b>ad crabcatchers from 34 bc and call a cab
i can match b<a>d crabcatchers from 34 bc and call a cab
i can match bad <c>rabcatchers from 34 bc and call a cab
i can match bad cr<abc>atchers from 34 bc and call a cab
i can match bad crabc<a>tchers from 34 bc and call a cab
i can match bad crabcat<c>hers from 34 bc and call a cab
i can match bad crabcatchers from 34 <bc> and call a cab
i can match bad crabcatchers from 34 bc <a>nd call a cab
i can match bad crabcatchers from 34 bc and <c>all a cab
i can match bad crabcatchers from 34 bc and c<a>ll a cab
i can match bad crabcatchers from 34 bc and call <a> cab
i can match bad crabcatchers from 34 bc and call a <c>ab
i can match bad crabcatchers from 34 bc and call a c<ab>

Sweet, eh?

EDIT: For the x in the beginning part, just put whatever x you want at the start of the match, before the very first optional capture group for the a part, so like this:

x(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

or like this

(?x)                        # enable non-insane mode

(?<Whole_Match>
    x                       # first match some leader string

    # now match a, b, and c, in that order, and each optional
    (?<Group_A> a ) ?
    (?<Group_B> b ) ?  
    (?<Group_C> c ) ?

    # now make sure we got at least one of a, b, or c
    (?(<Group_A>)           # SUCCEED!
      | (?(<Group_B>)       # SUCCEED!
          | (?(<Group_C>)   # SUCCEED!
              |             (*FAIL)
            )
        )
    )
)

The test sentence was constructed without the x part, so it won’t work for that, but I think I’ve shown how I mean to go at this. Note that all of x, a, b, and c can be arbitrarily complex patterns (yes, even recursive), not merely single letters, and it doesn’t matter if they use numbered capture groups of their own, even.

If you want to go at this with lookaheads, you can do this:

(?x)

(?(DEFINE)
       (?<Group_A> a)
       (?<Group_B> b)
       (?<Group_C> c)
)

x

(?= (?&Group_A)
  | (?&Group_B)
  | (?&Group_C)
)

(?&Group_A) ?
(?&Group_B) ?
(?&Group_C) ?

And here is what to add to the @pats array in the test program to show that this approach also works:

qr{
    (?(DEFINE)
        (?<Group_A> a)
        (?<Group_B> b)
        (?<Group_C> c)
    )

    (?= (?&Group_A)
      | (?&Group_B)
      | (?&Group_C)
    )

    (?&Group_A) ?
    (?&Group_B) ?
    (?&Group_C) ?
}x

You’ll notice please that I still manage never to repeat any of a, b, or c, even with the lookahead technique.

Do I win? ☺


Not trivially, if you don't have lookahead.

x(ab?c?|bc?|c)


How about this:

x(?:a())?(?:b())?(?:c())?(\1|\2|\3)

The empty capturing groups after a, b and c will always match (an empty string) if a, b or c match, in that order.

The (\1|\2|\3) part will only match if at least one of the preceding groups participated in the match. So if you just have x, the regex fails.

Every part of the regex will be evaluated just once.

Of course, if x, a, b and c are more complex subexpressions that contain capturing groups themselves, you have to adjust the numbers of the backreferences accordingly*.

Since this regex does look a bit strange, here's the verbose version:

x          # Match x
(?:a())?   # Try to match a. If this succeeds, \1 will contain an empty string.
(?:b())?   # Same with b and \2.
(?:c())?   # Same with c and \3.
(\1|\2|\3) # Now try to match the content of one of the backreferences. 
           # This works if one of the empty parentheses participated in the match.
           # If so, the backref contains an empty string which always matches. 
           # Bingo!

You might need to surround this with anchors (^ and $) unless you don't mind it matching xb within the string cxba etc.

For example, in Python:

>>> r=re.compile(r"x(?:a())?(?:b())?(?:c())?(\1|\2|\3)$")
>>> for test in ("x", "xa", "xabc", "xba"):
...     m = r.match(test)
...     if m:
...         print("{} --> {}".format(test, m.group(0)))
...     else:
...         print("{} --> no match".format(test))
...
x --> no match
xa --> xa
xabc --> xabc
xba --> no match

*or, if your regex flavor knows named capturing groups, you can use those, for example

x(?:a(?P<a>))?(?:b(?P<b>))?(?:c(?P<c>))?((?P=a)|(?P=b)|(?P=c))

in Python/PCRE. In .NET (and possibly other flavors), it's even legal to have several capturing groups that use the same name, making another simplification possible:

x(?:a(?<m>))?(?:b(?<m>))?(?:c(?<m>))?\k<m>


How about something like

x(?=[abc])a?b?c?


If you absolutely must not repeat a, b, or c, then this is the shortest, simplest regex--provided that x represents a fixed-length expression, or that the implementation you are using supports a variable-length one. It uses a negative look-behind, and Perl, for example, will die on a variable length look-behind.

Basically, it's what you are saying, rephrased:

/(x)a?b?c?(?<!x)/;

Here's what it says: I want to match xa?b?c? but when I consider it I don't want the last expression to have been x.

In addition, it will not work if the match for a, b, or c ends with x. (hat-tip: tchrist)


Here's the shortest I could come up with:

x(ab?c?|bc?|c)

I believe it matches the criteria while minimising repetition (although there is some). It also avoid using any look-aheads or other processor-intensive expressions, which is probably more valuable than saving regex string length.

This version repeats c three times. You could adapt it so that either a or b is the one repeated most often, so you could choose the shortest of a, b and c to be the one to be repeated three times.


If you don't need to find a maximal (greedy) match, you can drop the "in that order", because if you match x(a|b|c) and ignore any following text you have already matched "at least one of a, b, and c, in that order". In other words, if all you need is a true/false answer (does it match or not), then x(a|b|c) is sufficient. (Another assumption: that you are trying to determine whether the input string contains a match, not whether the whole string matches the regexp. I.e. see @Alan Moore's question.)

However if you want to identify a maximal match, or match against the entire input string, you can use lookahead: x(?=(a|b|c))a?b?c?

There is some redundancy there but a lot less than the combinatorial approach you were trying to avoid.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜