Remove all nested blocks, whilst leaving non-nested blocks alone via python

2022-12-14 17:08 问答作者：

Source:

[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]

Resultant text:

[This] is some text with

I don't think you can do a regex for this, from looking at the threads at stack overflow.

Is there a simple way to to do this -> or must one reach for pypa开发者_JS百科rsing (or other parsing library)?

Here's an easy way that doesn't require any dependencies: scan the text and keep a counter for the braces that you pass over. Increment the counter each time you see a "["; decrement it each time you see a "]".

As long as the counter is at zero or one, put the text you see onto the output string.
Otherwise, you are in a nested block, so don't put the text onto the output string.
If the counter doesn't finish at zero, the string is malformed; you have an unequal number of opening and closing braces. (If it's greater than zero, you have that many excess [s; if it's less than zero you have that many excess ]s.)

Taking the OP's example as normative (any block including further nested blocks must be removed), what about...:

import itertools

x = '''[This] is some text with [some [blocks that are nested [in a [variety]
of ways]]] and some [which are not], and [any [with nesting] must go] away.'''

def nonest(txt):
  pieces = []
  d = 0
  level = []
  for c in txt:
    if c == '[': d += 1
    level.append(d)
    if c == ']': d -= 1
  for k, g in itertools.groupby(zip(txt, level), lambda x: x[1]>0):
    block = list(g)
    if max(d for c, d in block) > 1: continue
    pieces.append(''.join(c for c, d in block))
  print ''.join(pieces)

nonest(x)

This emits

[This] is some text with  and some [which are not], and  away.

which under the normatime hypothesis would seem to be the desired result.

The idea is to compute, in level, a parallel list of counts "how nested are we at this point" (i.e., how many opened and not yet closed brackets have we met so far); then segment the zip of level with the text, with groupby, into alternate blocks with zero nesting and nesting > 0. For each block, the maximum nesting herein is then computed (will stay at zero for blocks with zero nesting - more generally, it's just the maximum of the nesting levels throughout the block), and if the resulting nesting is <= 1, the corresponding block of text is preserved. Note that we need to make the group g into a list block as we want to perform two iteration passes (one to get the max nesting, one to rejoin the characters into a block of text) -- to do it in a single pass we'd need to keep some auxiliary state in the nested loop, which is a bit less convenient in this case.

You will be better off writing a parser, especially if you use a parser generator like pyparsing. It will be more maintainable and extendable.

In fact pyparsing already implements the parser for you, you just need to write the function that filters the parser output.

I took a couple of passes at writing a single parser expression that could be used with expression.transformString(), but I had difficulty distinguish between nested and unnested []'s at parse time. In the end I had to open up the loop in transformString and iterate over the scanString generator explicitly.

To address the question of whether [some] should be included or not based on the original question, I explored this by adding more "unnested" text at the end, using this string:

src = """[This] is some text with [some [blocks that are 
    nested [in a [variety] of ways]] in various places]"""

My first parser follows the original question's lead, and rejects any bracketed expression that has any nesting. My second pass takes the top level tokens of any bracketed expression, and returns them in brackets - I didn't like this solution so well, as we lose the information that "some" and "in various places" are not contiguous. So I took one last pass, and had to make a slight change to the default behavior of nestedExpr. See the code below:

from pyparsing import nestedExpr, ParseResults, CharsNotIn

# 1. scan the source string for nested [] exprs, and take only those that
# do not themselves contain [] exprs
out = []
last = 0
for tokens,start,end in nestedExpr("[","]").scanString(src):
    out.append(src[last:start])
    if not any(isinstance(tok,ParseResults) for tok in tokens[0]):
        out.append(src[start:end])
    last = end
out.append(src[last:])
print "".join(out)


# 2. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each
out = []
last = 0
for t,s,e in nestedExpr("[","]").scanString(src):
    out.append(src[last:s])
    topLevel = [tok for tok in t[0] if not isinstance(tok,ParseResults)]
    out.append('['+" ".join(topLevel)+']')
    last = e
out.append(src[last:])
print "".join(out)


# 3. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each, keeping each group separate
out = []
last = 0
for t,s,e in nestedExpr("[","]", CharsNotIn('[]')).scanString(src):
    out.append(src[last:s])
    for tok in t[0]:
        if isinstance(tok,ParseResults): continue
        out.append('['+tok.strip()+']')
    last = e
out.append(src[last:])
print "".join(out)

Giving:

[This] is some text with 
[This] is some text with [some in various places]
[This] is some text with [some][in various places]

I hope one of these comes close to the OP's question. But if nothing else, I got to explore nestedExpr's behavior a little further.

继续阅读：brackets nested python recursion regex

Remove all nested blocks, whilst leaving non-nested blocks alone via python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？