开发者

Can regexes containing ordered alternation be rewritten to use only unordered alternation?

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?, *, and +, and the nongreedy quantifiers ??, *?, and +?. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?

(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa would match the strings a and aa if the alternation is unordered, but only a if the alternation is ordered.)

Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?

If this question has been considered in the literat开发者_如何学Goure, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).


in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.


I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:

  1. Let's say we have the regex x||y where x and y are regexen and || means the unordered alternation. If so we can construct DFA's accepting x and y. We will mark those DFA_x and DFA_y
  2. We will construct the DFA for x||y in phases by connecting DFA_x and DFA_y
  3. For every path in DFA_x corresponding to some string a (by path I mean a path in the graph sense without traversing and edge twice so a is a path in DFA_"a*" but aa is not)...
    • For every symbol in the alphabet s
      • If DFA_y consumes as (that is if run on as DFA_y will not stop early but it may not necessarily accept) and DFA_x does not and DFA_x doesn't accept any prefix of as create a transition from the state DFA_x ends in after consuming a to the state DFA_y ends in after consuming as
  4. The accepting states of the final DFA are all the accepting states of both the input DFA's. The starting state is the starting state of DFA_x.

Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜