Python - Modifying a backreference. Can it be done?
New to Python so please forgive my ignorance. I'm trying to modify backreferenced strings in a regular expression.
Example:
>>>a_string
'fsa fad fdsa dsafasdf u.s.a. U.S.A. u.s.a fdas adfs.f fdsa f.afda'
>>> re.sub(r'(?<=\s)(([a-zA-Z]\.)+[a-zA-Z]\.{0,1})(?=\s)', '<acronym>'+re.sub(r'\.',r'',(r'\1').upper())+'</acronym>', a_string)
'fsa fad fdsa dsafasdf <acronym>u.s.a.</acronym> <acronym>U.S.A.</acronym> <acronym>u.s.a</acronym> fdas adfs.f fdsa f.afda'
Instead of the output I desire:
'fsa fad fdsa dsafasdf <acronym>USA</acronym> <acronym>USA</acronym> <acronym>USA</acronym> fdas adfs.f fdsa f.afda'
Thanks for your开发者_如何学运维 help.
From the docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:
And see the example contained in the linked docs.
As Ignacio Vazquez-Abrams suggested, you can solve your problems by passing a callable function to re.sub()
. I figured that sample code would explain it best, so here you go:
import re
s = "fsa fad fdsa dsafasdf u.s.a. U.S.A. u.s.a fdas adfs.f fdsa f.afda"
s_pat = r'(?<=\s)(([a-zA-Z]\.)+[a-zA-Z]\.{0,1})(?=\s)'
pat = re.compile(s_pat)
def add_acronym_tag(match_object):
s = match_object.group(0)
s = s.replace('.', '').upper()
return "<acronym>%s</acronym>" % s
s = re.sub(pat, add_acronym_tag, s)
print s
The above prints:
fsa fad fdsa dsafasdf <acronym>USA</acronym> <acronym>USA</acronym> <acronym>USA</acronym> fdas adfs.f fdsa f.afda
So you aren't actually modifying the backreference, because strings are immutable. But this is just as good: you can write a function to do any processing you want, and then return whatever you want, and that is what re.sub()
will insert in the final result.
Note that you can use regular expressions inside your function; I just used the .replace()
string method because you just want to get rid of a single character, and you don't really need the full power of regular expressions for that.
The "modifying a backreference" needs re-phrasing as you seem to confuse the notions.
A replacement backreference is a special combination of characters iniside a string that tells the regex engine to refer to some specific capturing group values (aka submatches) retrieved during a match operation.
When you use r'\1'.upper()
, you are trying to make the \1
string uppercase, and as \1
has no uppercasable letters, you get \1
as a result, and this \1
- unchanged - is applied as the (part of) string replacement pattern.
That is why you can't modify the capturing group value this way.
That is why you have to use a callable as the replacement argument (see Ignacio's answer): you need to pass the match object to the re.sub
to be able to manupilate the submatches (although you may of course replace a char or two in a backrefence, say, r'\g<12>'.replace('2','1')
to "obfuscate" \g<11>
backreference, but there is little sense in this operation).
Context
- python 3.x
- using re.sub to do regular expression substitution
- apply arbitrary modification on part of a string, without modifying the whole string
Problem scenario
- UserMattL/uu002matt1675257544 wants to match part of a string with regex
- user wants to modify the matched part of the string
Solution
- The general solution to this scenario is already posted elsewhere in this thread
- This answer shows a simpler example that does basically the same thing, but using python
lambda
instead of declaring a standalone function
Example
- user has a string matching a MSFT Windows filepath specification
- user wants to change the drive letter to lowercase, but without modifying any other part of the string
- regex is appropriate here, since the drive letter can be any character, making
str.replace()
impractical
import re
ss7676demotest = 'D:/AlphaOne/BravoTwo'
rx7676demotest = re.compile(r'^(\w):')
ss7676demotest = re.sub(rx7676demotest, lambda obmatch: '{vjj}:'.format(vjj=obmatch.group(1).lower()), ss7676demotest,)
print(ss7676demotest) ## d:/AlphaOne/BravoTwo
Rationale
- python
lambda
allows user to minimize the amount of code
Pitfalls
- the question talks about modifying a backreference, but that's actually not what's going on here
- backreference is a regex concept for specifying subregions of a string that match a specific part of a regex
- in this context, we want to specify one of those matched subregions so we can modify it without affecting any other part of the string
- we do that with python
regex.match()
object (eg,obmatch.group(1)
in this example)
- python
lambda
is practical in simple scenarios that do not involve a lot of logic, but not always favorable- for more elaborate transformations, it is favorable to write a standalone function instead of using a lambda
- to enhance readability
- to make the code more maintainable
- see the other answer in this thread for the alternate non-lambda approach
精彩评论