Mutually exclusive regular expressions
If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
开发者_StackOverflow中文版It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible. What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to convert the regex to a NFA, then convert the NFA to a DFA. But that last step can result in an exponential blowup in the number of DFA states, so this will only be feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages are not closed under intersection, so this method won't work.
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
You can't do that by only looking at the regular expression.
Consider the case where you have [0-9]
and [0-9]+
. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.
The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.
精彩评论