Finding All Characters Between Parentheses with a .NET Regex
I need to get all characters between '(' and ')' chars.
var str = "dfgdgdfg (aaa.bbb) sfd (c) fdsdfg ( ,dd开发者_如何学JAVAd (eee) )";
In this example, I need to get 3 strings:
(aaa.bbb)
(c)
( ,ddd (eee) )
What pattern I have to write? Please, help.
Try something like this:
\(([^)]+)\)
Edit: Actually this does quite work for the last bit - this expression doesn't capture the last substring properly. I have CW'd this answer so that someone with more time can flesh it out to make it work properly.
.NET supports recursion in regular expressions using balancing groups. See, for example, http://blog.stevenlevithan.com/archives/balancing-groups
Mastering Regular Expressions comes highly recommended
You want to use the balanced matching group feature of .net regular expressions.
var s = "dfgdgdfg (aaa.bbb) sfd (c) fdsdfg ( ,ddd (eee) )";
var exp = "\([^()]*((?<paren>\()[^()]*|(?<close-paren>\))[^()]*)*(?(paren)(?!))\)";
var matches = Regex.Matches(s,exp);
You either need a lexer/parser combo, or use a lexer with stack support. But regex on it's own, will get you nowhere.
You need recursion to do this.
A Perl example:
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\( # match an opening paren
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^()]+ # one or more
) # end non backtracking group
| # ... or ...
(?1) # recurse to opening 1 and try it again
)* # 0 or more times.
) # end of buffer 2
\) # match a closing paren
) # end capture buffer one
/x;
sub strip {
my ($str) = @_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$match\n";
strip($striped) if $striped=~/\(/;
return $striped;
}
}
$str="dfgdgdfg (aaa.bbb) sfd (c) fdsdfg ( ,ddd (eee) )";
print "\n\nstart=$str\n";
while ($str=~/$re/g) {
strip($1) ;
}
Output:
start=dfgdgdfg (aaa.bbb) sfd (c) fdsdfg ( ,ddd (eee) )
(aaa.bbb)
(c)
( ,ddd (eee) )
(eee)
As already mentioned by others: regex is not well suited for such a task. However, if your parenthesis do not exceed a fix number of nesting, you could do it, but if the nesting can be 3 or more, the regex will become a pain to write (and maintain!). Have a look at the regex that matches parenthesis with at most one nested parenthesis in it:
\((?:[^()]|\([^)]*\))*\)
which means:
\( # match the character '('
(?: # start non-capture group 1
[^()] # match any character not from the set {'(', ')'}
| # OR
\( # match the character '('
[^)]* # match any character not from the set {')'} and repeat it zero or more times
\) # match the character ')'
)* # end non-capture group 1 and repeat it zero or more times
\) # match the character ')'
The version for 3 will make your eyes bleed! You could go with .NET's feature of recursive regex matching, but I personally wouldn't go: sprinkling recursion inside regex leads to madness! (not really of course, but regex are hard enough to comprehend and mixing recursion to the mix, doesn't make it any clearer IMO)
I'd just write a small method that might look like this Python snippet:
def find_parens(str):
matches = []
parens = 0
start_index = -1
index = 0
for char in str:
if char == '(':
parens = parens+1
if start_index == -1:
start_index = index
if char == ')':
parens = parens-1
if parens == 0 and start_index > -1:
matches.append(str[start_index:index+1])
start_index = -1
index = index+1
return matches
for m in find_parens("dfgdgdfg (aaa.bbb) sfd (c) fdsdfg ( ,ddd (eee) )"):
print(m)
which prints:
(aaa.bbb)
(c)
( ,ddd (eee) )
I'm not familiar with C#, but the Python code above reads just like pseudo code and wouldn't take much effort to convert into C# I presume.
Not saying this is better than Regex, but here's another option
public static IEnumerable<string> InParen(string s)
{
int count = 0;
StringBuilder sb = new StringBuilder();
foreach (char c in s)
{
switch (c)
{
case '(':
count++;
sb.Append(c);
break;
case ')':
count--;
sb.Append(c);
if (count == 0)
{
yield return sb.ToString();
sb = new StringBuilder();
}
break;
default:
if (count > 0)
sb.Append(c);
break;
}
}
}
If you only need to handle a single level of nesting you can use a pair of mutually exclusive patterns.
(\([^()]*\))
(\([^()]*\([^()]*\)[^()]*\))
Or you can skip regular expressions and just parse the string directly. Increment a state variable on (, decrement on ), and print a line when it returns to zero.
精彩评论