Regexp, skip nested pairs
In my own markup language, I have a quotation tags >>which use these characters to make a blockquote<<
. The problem starts when there is a nested blockquote:
>>(1)
start1
>>(2)quote 2!<<(3)
<<(4)
I would like to match only the most outer tags, like that:
<blockquote>
start1
>>quote 2!<<
</blockquote>
If I try a simple ungreedy regex />>(.+?)<</
, (1) and (3) will be matched and (2) and (4) won't be ever matched. If I make it ungreedy />>(.+)>>/
(1) and (4) will successfully match (and by recursively calling the function I can then match (2) with (3)), but it won't work when I will have two blocks in the same piece of text:
>>(A)quote1<<(B)
>>(C开发者_高级运维)quote2<<(D)
The greedy one will match (A) with (D), leaving (B) and (C) alone. I suppose I'd have to somehow make it "Ungreedy, but only if there are no other pairs inside", which goes beyond my skills. Is there a way to make it work correctly? So (1) matches (4), (A) matches (B) and (C) matches (D)? If you can think of non regexp solution (but not a parser) then it would be good enough for me too. I am not asking how to also make the (2) match (3), just how to skip them (or any other nested pairs) successfully.
Success! Inspired by Arjen's suggestion, in the end I used such construction (not necessarily working:
$text = str_replace('([^>]|^)>([^>]|$)', '$1>$2', $text);
while ($len != strlen($text)){
$len = strlen($text);
$text = preg_replace_callback('/>>([^>]+?)<</', "blockHashFunction", $text);
}
ie. I first encode all single >'s and then perform a recursive preg_replace. Hashing in this case means that the >>asdsad<<
is replaced by, for example "\xFE:3:\xFE"
which, at the end of the script it is unhashed (well, more like decoded actually, I guess) into the proper <blockquote>asdsad</blockquote>
.
Regular expressions are not really suited for this kind of parsing. There actually are some RegEx engines that do support for nested/balanced matching, like the .NET Framework RegEx engine (see: http://blogs.msdn.com/b/bclteam/archive/2005/03/15/396452.aspx). However, I feel this leads to very complex patterns.
You are much better if you create a regular expression that matches a begin or end-tag and manually create a tree of all matches. After processing the entire string you can discard the unwanted matches from the resulting collection.
精彩评论