开发者

Regexp, skip nested pairs

In my own markup language, I have a quotation tags >>which use these characters to make a blockquote<<. The problem starts when there is a nested blockquote:

>>(1)
start1
  >>(2)quote 2!<<(3)
<<(4)

I would like to match only the most outer tags, like that:

<blockquote>
start1
  >>quote 2!<<
</blockquote>

If I try a simple ungreedy regex />>(.+?)<</, (1) and (3) will be matched and (2) and (4) won't be ever matched. If I make it ungreedy />>(.+)>>/ (1) and (4) will successfully match (and by recursively calling the function I can then match (2) with (3)), but it won't work when I will have two blocks in the same piece of text:

>>(A)quote1<<(B)

>>(C开发者_高级运维)quote2<<(D)

The greedy one will match (A) with (D), leaving (B) and (C) alone. I suppose I'd have to somehow make it "Ungreedy, but only if there are no other pairs inside", which goes beyond my skills. Is there a way to make it work correctly? So (1) matches (4), (A) matches (B) and (C) matches (D)? If you can think of non regexp solution (but not a parser) then it would be good enough for me too. I am not asking how to also make the (2) match (3), just how to skip them (or any other nested pairs) successfully.

Success! Inspired by Arjen's suggestion, in the end I used such construction (not necessarily working:

$text = str_replace('([^>]|^)>([^>]|$)', '$1&gt;$2', $text);
while ($len != strlen($text)){
    $len = strlen($text);
    $text = preg_replace_callback('/>>([^>]+?)<</', "blockHashFunction", $text);
}

ie. I first encode all single >'s and then perform a recursive preg_replace. Hashing in this case means that the >>asdsad<< is replaced by, for example "\xFE:3:\xFE" which, at the end of the script it is unhashed (well, more like decoded actually, I guess) into the proper <blockquote>asdsad</blockquote>.


Regular expressions are not really suited for this kind of parsing. There actually are some RegEx engines that do support for nested/balanced matching, like the .NET Framework RegEx engine (see: http://blogs.msdn.com/b/bclteam/archive/2005/03/15/396452.aspx). However, I feel this leads to very complex patterns.

You are much better if you create a regular expression that matches a begin or end-tag and manually create a tree of all matches. After processing the entire string you can discard the unwanted matches from the resulting collection.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜