开发者

Using regex to replace object within brackets in a text file

I have an opened text file, f. I need to find every instance of of square brackets enclosing text, inclusive of the brackets. For example, with --

1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last

It would match/print:

1 - [First]
3 - [Finally]
3 - [B]

Once I have printed these matches, I'd like to delete them and normalize any excessive whitespace, so the final text would be:

1 - This is the line
2 - (And) another line
3 - the last

The function would conceptually look like this, though I'm having trouble doing the regex part of it:

def find_and_replace(file):
    f=open(file)
    regex = re.compile("[.+]")
    find regex.开发者_如何学Call
    for item in regex.all:
        print item, line-number
        replace(item, '')
        normalize white space

Thank you.


You have to escape the [] chars and use a non greedy operator

r'\[.+?\]'

Note you won't be able to have nested brackets like [foo [bar]] using regex.

Also, to remove extra spaces, add \s? to the end of the regex.

Example:

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last


With the regex of JBernardo, to display the line and its number for each removing of a bracketed chunk of string:

import re

ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''

print ss,'\n'

dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))

def repl(mat, countline =[1]):
    if mat.group(1):
        print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
        countline[0] += 1
        return mat.group(1)
    else:
        print "line %s: removing %10s  in  %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
        return ''

print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)

results in

When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:— 

line 1: removing  '[xxxx] '  in  'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing    '[yyy]'  in  "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ  ] '  in  'Behind the gateways[ZZZZ  ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing    '[AAA]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] '  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing   '[BBBB]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'

When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—

But as JBernardo has pointed the fact out, there will be problems with this regex if there are nested brackets in the string:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)

produces

one ] end of line

If the regex' pattern is modified, only the more nested bracketed chunks will be removed anyway:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)

gives

one [two ] end of line

.

So I searched solutions for various subcases in case you'd want to treat all the nested bracketed chunks of string as well.
Since regexes are not parsers, we can't remove bracketed chunk containing nested bracketed chunks without doing an iteration to progressively remove all the bracketed chunks in a several-levels nest of them

.

Subcase 1

Simple removing of nested bracketed chunks:

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub('\\1',x)
    return x


print '\n==========================\n'+clean(ss)

I give only the result. Execute if you want to follow the execution.

This is the line   
(And) another line
 initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
 shifted content
    fgjezhr][fgh

One can notice that it remains a blank for the two initial lines:

   [Inter][A] initially shifted
    [Away [is this] [][4] ] shifted content

are transformed into

 initially shifted
 shifted content

Subcase 2 :

So I improved the regex and algorithm to clean ALL the first blanks at the beginning of such lines.

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
    def repl(mat):
        return '' if mat.group(1) else mat.group(2)
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

result

This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
shifted content
    fgjezhr][fgh

The lines having blanks at the begining but having no corrected bracketed chunks remain unmodified. If you would like to eliminate starting blanks in such lines too, you would better do a strip() on all the lines, and then you wouldn't need this solution, the former one would be sufficient

Subcase 3:

To add the display of the lines in which a removing is performed, it is now necessary to do a modification in the code to take account that we perform an iteration :

  • the lines progressively change at each turn of the iteration and we can't use a constant dico_lines

  • moreover at each turn of the iteration the counter of lines must be moved down to 1

To obtain these two adaptations, I use kind of a trick: modifying the func_default of the replacing function

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
          regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):

    def repl(mat, cnt = None, dico_lignes = None):
        if mat.group(1):
            print "line %s: detecting %s  ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
            cnt[0] += 1
            return mat.group(1)
        if mat.group(4):
            print "line %s: removing %s   IN   %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
            return '' if mat.group(2) else mat.group(3)

    while rag.search(x):
        print '\n--------------------------\n'+x
        repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

result

--------------------------
This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh

line 1: removing '[first]'   IN   'This is the [first]       line   \n'
line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: removing '[Inter]'   IN   '   [Inter][A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: removing '[Finally]'   IN   '[Finally][B] the last\n'
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ 1]'   IN   '[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases\n'
line 6: removing '[some]'   IN   '[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: removing '[]'   IN   'tuvulu[]gusti perena[3]              bdiiii\n'
line 7: removing '[3]'   IN   'tuvulu[]gusti perena[3]              bdiiii\n'
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[is this]'   IN   '    [Away [is this] [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
[A] initially shifted
[B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref ] there are] other ]cases
tuvulugusti perenabdiiii
    [Away [][4] ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: removing '[A]'   IN   '[A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: removing '[B]'   IN   '[B] the last\n'
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ref ]'   IN   '[Note that [ by the way [ref ] there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[]'   IN   '    [Away [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way there are] other ]cases
tuvulugusti perenabdiiii
    [Away [4] ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ by the way there are]'   IN   '[Note that [ by the way there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[4]'   IN   '    [Away [4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
[Note that other ]cases
tuvulugusti perenabdiiii
    [Away ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[Note that other ]'   IN   '[Note that other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[Away ]'   IN   '    [Away ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

==========================
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
shifted content
    fgjezhr][fgh


the regex:

re.findall('\[[^\]]+\]', 'foo [bar] baz')

yields:

['[bar]']

so:

re.compile('\[[^\]]+\]')

should work for you


On the regex front, "[.+]" is going to create a character class that will match a . or a +. You need to escape the [ and ] characters since they have special meaning in regular expressions. Additionally, this will match strings like [a] foo [b] since the quantifiers are greedy by default. Add a ? after the + to tell it to match the shortest sequence of characters possible.

So try "\\[.+?\\]" and see if that works.

If you want to find and remove [] as well, then replace the + quantifier with *.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜