Using regex to replace object within brackets in a text file
I have an opened text file, f. I need to find every instance of of square brackets enclosing text, inclusive of the brackets. For example, with --
1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
It would match/print:
1 - [First]
3 - [Finally]
3 - [B]
Once I have printed these matches, I'd like to delete them and normalize any excessive whitespace, so the final text would be:
1 - This is the line
2 - (And) another line
3 - the last
The function would conceptually look like this, though I'm having trouble doing the regex part of it:
def find_and_replace(file):
f=open(file)
regex = re.compile("[.+]")
find regex.开发者_如何学Call
for item in regex.all:
print item, line-number
replace(item, '')
normalize white space
Thank you.
You have to escape the []
chars and use a non greedy operator
r'\[.+?\]'
Note you won't be able to have nested brackets like [foo [bar]]
using regex.
Also, to remove extra spaces, add \s?
to the end of the regex.
Example:
>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last
With the regex of JBernardo, to display the line and its number for each removing of a bracketed chunk of string:
import re
ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''
print ss,'\n'
dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))
def repl(mat, countline =[1]):
if mat.group(1):
print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
countline[0] += 1
return mat.group(1)
else:
print "line %s: removing %10s in %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
return ''
print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
results in
When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—
line 1: removing '[xxxx] ' in 'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing '[yyy]' in "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ ] ' in 'Behind the gateways[ZZZZ ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing '[AAA]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] ' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[BBBB]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
But as JBernardo has pointed the fact out, there will be problems with this regex if there are nested brackets in the string:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
produces
one ] end of line
If the regex' pattern is modified, only the more nested bracketed chunks will be removed anyway:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
gives
one [two ] end of line
.
So I searched solutions for various subcases in case you'd want to treat all the nested bracketed chunks of string as well.
Since regexes are not parsers, we can't remove bracketed chunk containing nested bracketed chunks without doing an iteration to progressively remove all the bracketed chunks in a several-levels nest of them
.
Subcase 1
Simple removing of nested bracketed chunks:
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub('\\1',x)
return x
print '\n==========================\n'+clean(ss)
I give only the result. Execute if you want to follow the execution.
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh
One can notice that it remains a blank for the two initial lines:
[Inter][A] initially shifted
[Away [is this] [][4] ] shifted content
are transformed into
initially shifted
shifted content
Subcase 2 :
So I improved the regex and algorithm to clean ALL the first blanks at the beginning of such lines.
def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
def repl(mat):
return '' if mat.group(1) else mat.group(2)
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
result
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh
The lines having blanks at the begining but having no corrected bracketed chunks remain unmodified. If you would like to eliminate starting blanks in such lines too, you would better do a strip() on all the lines, and then you wouldn't need this solution, the former one would be sufficient
Subcase 3:
To add the display of the lines in which a removing is performed, it is now necessary to do a modification in the code to take account that we perform an iteration :
the lines progressively change at each turn of the iteration and we can't use a constant dico_lines
moreover at each turn of the iteration the counter of lines must be moved down to 1
To obtain these two adaptations, I use kind of a trick: modifying the func_default of the replacing function
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):
def repl(mat, cnt = None, dico_lignes = None):
if mat.group(1):
print "line %s: detecting %s ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
cnt[0] += 1
return mat.group(1)
if mat.group(4):
print "line %s: removing %s IN %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
return '' if mat.group(2) else mat.group(3)
while rag.search(x):
print '\n--------------------------\n'+x
repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
result
--------------------------
This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
line 1: removing '[first]' IN 'This is the [first] line \n'
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: removing '[Inter]' IN ' [Inter][A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: removing '[Finally]' IN '[Finally][B] the last\n'
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ 1]' IN '[Note that [ by the way [ref [ 1]] there are] [some] other ]cases\n'
line 6: removing '[some]' IN '[Note that [ by the way [ref [ 1]] there are] [some] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: removing '[]' IN 'tuvulu[]gusti perena[3] bdiiii\n'
line 7: removing '[3]' IN 'tuvulu[]gusti perena[3] bdiiii\n'
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[is this]' IN ' [Away [is this] [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
[A] initially shifted
[B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref ] there are] other ]cases
tuvulugusti perenabdiiii
[Away [][4] ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: removing '[A]' IN '[A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: removing '[B]' IN '[B] the last\n'
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ref ]' IN '[Note that [ by the way [ref ] there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[]' IN ' [Away [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
[Note that [ by the way there are] other ]cases
tuvulugusti perenabdiiii
[Away [4] ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ by the way there are]' IN '[Note that [ by the way there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[4]' IN ' [Away [4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
[Note that other ]cases
tuvulugusti perenabdiiii
[Away ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[Note that other ]' IN '[Note that other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[Away ]' IN ' [Away ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
==========================
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh
the regex:
re.findall('\[[^\]]+\]', 'foo [bar] baz')
yields:
['[bar]']
so:
re.compile('\[[^\]]+\]')
should work for you
On the regex front, "[.+]"
is going to create a character class that will match a .
or a +
. You need to escape the [
and ]
characters since they have special meaning in regular expressions. Additionally, this will match strings like [a] foo [b]
since the quantifiers are greedy by default. Add a ?
after the +
to tell it to match the shortest sequence of characters possible.
So try "\\[.+?\\]"
and see if that works.
If you want to find and remove []
as well, then replace the +
quantifier with *
.
精彩评论