开发者

Removing text within parentheses (parentheses within parentheses prob)

I am trying to remove te开发者_开发知识库xt that is within parentheses (along with the parentheses themselves) but am having trouble with the scenario where there are parentheses within parentheses. This is the method I am using (in Ruby):

sentence.gsub(/\(.*?\)/, "") 

and that works fine until I have a sentence such as:

"This is (a test (string))"

Then the above chokes. Anyone have any idea how to do this? I am completely stumped.


One approch is to replace the parenthetical groups from the inside out:

x = string.dup
while x.gsub!(/\([^()]*\)/,""); end
x


Looks like you need to be greedy, by removing the ?

>> "This is (a test (string))".gsub(/\(.*\)/, "")
=> "This is "

That makes it go to the last ) instead of the first. It doesn't capture nesting, though, because a regex can't do that.


The problem with this is that languages containing nested parentheses (or indeed anything nested, IOW anything that requires recursion) are not regular, they are at least context-free. This means that they cannot be described by a regular grammar. Regular expressions are a compact notation for regular grammars. Ergo, nested parentheses cannot be described by regular expressions.

However, we aren't talking about regular expressions here, we are talking about Regexps. While their semantics and syntax are (very) loosely based on regular expressions, they are quite different and especially much more powerful. Depending on the particular flavor of Regexp you use, they may or may not be able to express recursion and thus parse nested parentheses. Perl Regex, for example can parse nested parentheses. I'm not sure whether Ruby's Regexp can, but I really don't care, because the way that Regexp are more powerful than regular expressions is generally achieved by bolting more and more syntax onto them.

This turns regular expressions, which are designed to be simple, in incomprehensible monsters. (If you can tell at a glance what the Perl Regex posted by @Anon does, then go for it. But I can't and thus I prefer not to use it.)

I prefer using a more powerful parser, rather than a complex Regexp.

In this case, you have a context-free language, therefore you can use a very simple recursive descent parser. You can further simplify your recursive descent parser by handling those sub-parts which are regular with a regular expression. Finally, if you replace the recursion in the recursive descent parser with iteration + mutation and make clever use of Ruby's boolean semantics, the entire parser gets basically condensed down to this single line:

while str.gsub!(/\([^()]*?\)/, ''); end

Which I don't think is too bad.

Here's the entire thing with some extra removal of duplicate whitespace and (of course) a test suite:

require 'test/unit'
class TestParenthesesRemoval < Test::Unit::TestCase
  def test_that_it_removes_even_deeply_nested_parentheses
    str = 'This is (was?) some ((heavily) parenthesized (but not overly so 
          (I hope))) text with (superflous) parentheses: )(.'
    res = 'This is some text with parentheses: )(.'

    while str.gsub!(/\([^()]*?\)/, ''); end
    str.squeeze!(' ')

    assert_equal res, str
  end
end


The following Perl regex will match balanced parentheses:

/(\((?:[^\(\)]++|(?1))*\))/

However, by the time you get to this point, you're not technically using "regular" expressions anymore.


jleedev's answer will work if there is just one set of parentheses at the outermost level; in that case, making the expression for the innards of those parentheses greedy should do the trick.

However, and perhaps a bit surprisingly, regexps as defined in Perl, Java, Ruby and a few other languages but also grep and sed are not suitable for dealing with this problem. There is no regexp for dealing with the general case of nested delimiters. This is one reason why people on SO yell at you when you want to use a regexp to process HTML or XML.

Interestingly, the creator of the Lua language addressed this problem by adding a new matching pattern to the otherwise rather simple pattern language. Look at the bottom handful of lines in http://www.lua.org/pil/20.2.html !

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜