开发者

Remove all nested blocks, whilst leaving non-nested blocks alone via python

Source:

[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]

Resultant text:

[This] is some text with

I don't think you can do a regex for this, from looking at the threads at stack overflow.

Is there a simple way to to do this -> or must one reach for pypa开发者_JS百科rsing (or other parsing library)?


Here's an easy way that doesn't require any dependencies: scan the text and keep a counter for the braces that you pass over. Increment the counter each time you see a "["; decrement it each time you see a "]".

  • As long as the counter is at zero or one, put the text you see onto the output string.
  • Otherwise, you are in a nested block, so don't put the text onto the output string.
  • If the counter doesn't finish at zero, the string is malformed; you have an unequal number of opening and closing braces. (If it's greater than zero, you have that many excess [s; if it's less than zero you have that many excess ]s.)


Taking the OP's example as normative (any block including further nested blocks must be removed), what about...:

import itertools

x = '''[This] is some text with [some [blocks that are nested [in a [variety]
of ways]]] and some [which are not], and [any [with nesting] must go] away.'''

def nonest(txt):
  pieces = []
  d = 0
  level = []
  for c in txt:
    if c == '[': d += 1
    level.append(d)
    if c == ']': d -= 1
  for k, g in itertools.groupby(zip(txt, level), lambda x: x[1]>0):
    block = list(g)
    if max(d for c, d in block) > 1: continue
    pieces.append(''.join(c for c, d in block))
  print ''.join(pieces)

nonest(x)

This emits

[This] is some text with  and some [which are not], and  away.

which under the normatime hypothesis would seem to be the desired result.

The idea is to compute, in level, a parallel list of counts "how nested are we at this point" (i.e., how many opened and not yet closed brackets have we met so far); then segment the zip of level with the text, with groupby, into alternate blocks with zero nesting and nesting > 0. For each block, the maximum nesting herein is then computed (will stay at zero for blocks with zero nesting - more generally, it's just the maximum of the nesting levels throughout the block), and if the resulting nesting is <= 1, the corresponding block of text is preserved. Note that we need to make the group g into a list block as we want to perform two iteration passes (one to get the max nesting, one to rejoin the characters into a block of text) -- to do it in a single pass we'd need to keep some auxiliary state in the nested loop, which is a bit less convenient in this case.


You will be better off writing a parser, especially if you use a parser generator like pyparsing. It will be more maintainable and extendable.

In fact pyparsing already implements the parser for you, you just need to write the function that filters the parser output.


I took a couple of passes at writing a single parser expression that could be used with expression.transformString(), but I had difficulty distinguish between nested and unnested []'s at parse time. In the end I had to open up the loop in transformString and iterate over the scanString generator explicitly.

To address the question of whether [some] should be included or not based on the original question, I explored this by adding more "unnested" text at the end, using this string:

src = """[This] is some text with [some [blocks that are 
    nested [in a [variety] of ways]] in various places]"""

My first parser follows the original question's lead, and rejects any bracketed expression that has any nesting. My second pass takes the top level tokens of any bracketed expression, and returns them in brackets - I didn't like this solution so well, as we lose the information that "some" and "in various places" are not contiguous. So I took one last pass, and had to make a slight change to the default behavior of nestedExpr. See the code below:

from pyparsing import nestedExpr, ParseResults, CharsNotIn

# 1. scan the source string for nested [] exprs, and take only those that
# do not themselves contain [] exprs
out = []
last = 0
for tokens,start,end in nestedExpr("[","]").scanString(src):
    out.append(src[last:start])
    if not any(isinstance(tok,ParseResults) for tok in tokens[0]):
        out.append(src[start:end])
    last = end
out.append(src[last:])
print "".join(out)


# 2. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each
out = []
last = 0
for t,s,e in nestedExpr("[","]").scanString(src):
    out.append(src[last:s])
    topLevel = [tok for tok in t[0] if not isinstance(tok,ParseResults)]
    out.append('['+" ".join(topLevel)+']')
    last = e
out.append(src[last:])
print "".join(out)


# 3. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each, keeping each group separate
out = []
last = 0
for t,s,e in nestedExpr("[","]", CharsNotIn('[]')).scanString(src):
    out.append(src[last:s])
    for tok in t[0]:
        if isinstance(tok,ParseResults): continue
        out.append('['+tok.strip()+']')
    last = e
out.append(src[last:])
print "".join(out)

Giving:

[This] is some text with 
[This] is some text with [some in various places]
[This] is some text with [some][in various places]

I hope one of these comes close to the OP's question. But if nothing else, I got to explore nestedExpr's behavior a little further.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜