Python - Modifying a backreference. Can it be done?

2023-04-01 05:33 问答作者：

New to Python so please forgive my ignorance. I'm trying to modify backreferenced strings in a regular expression.

Example:

>>>a_string
'fsa fad fdsa dsafasdf u.s.a. U.S.A. u.s.a fdas adfs.f fdsa f.afda'
>>> re.sub(r'(?<=\s)(([a-zA-Z]\.)+[a-zA-Z]\.{0,1})(?=\s)', '<acronym>'+re.sub(r'\.',r'',(r'\1').upper())+'</acronym>', a_string)
'fsa fad fdsa dsafasdf <acronym>u.s.a.</acronym> <acronym>U.S.A.</acronym> <acronym>u.s.a</acronym> fdas adfs.f fdsa f.afda'

Instead of the output I desire:

'fsa fad fdsa dsafasdf <acronym>USA</acronym> <acronym>USA</acronym> <acronym>USA</acronym> fdas adfs.f fdsa f.afda'

Thanks for your开发者_如何学运维 help.

From the docs:

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

And see the example contained in the linked docs.

As Ignacio Vazquez-Abrams suggested, you can solve your problems by passing a callable function to re.sub(). I figured that sample code would explain it best, so here you go:

import re

s = "fsa fad fdsa dsafasdf u.s.a. U.S.A. u.s.a fdas adfs.f fdsa f.afda"

s_pat = r'(?<=\s)(([a-zA-Z]\.)+[a-zA-Z]\.{0,1})(?=\s)'
pat = re.compile(s_pat)

def add_acronym_tag(match_object):
    s = match_object.group(0)
    s = s.replace('.', '').upper()
    return "<acronym>%s</acronym>" % s

s = re.sub(pat, add_acronym_tag, s)
print s

The above prints:

fsa fad fdsa dsafasdf <acronym>USA</acronym> <acronym>USA</acronym> <acronym>USA</acronym> fdas adfs.f fdsa f.afda

So you aren't actually modifying the backreference, because strings are immutable. But this is just as good: you can write a function to do any processing you want, and then return whatever you want, and that is what re.sub() will insert in the final result.

Note that you can use regular expressions inside your function; I just used the .replace() string method because you just want to get rid of a single character, and you don't really need the full power of regular expressions for that.

The "modifying a backreference" needs re-phrasing as you seem to confuse the notions.

A replacement backreference is a special combination of characters iniside a string that tells the regex engine to refer to some specific capturing group values (aka submatches) retrieved during a match operation.

When you use r'\1'.upper(), you are trying to make the \1 string uppercase, and as \1 has no uppercasable letters, you get \1 as a result, and this \1 - unchanged - is applied as the (part of) string replacement pattern.

That is why you can't modify the capturing group value this way.

That is why you have to use a callable as the replacement argument (see Ignacio's answer): you need to pass the match object to the re.sub to be able to manupilate the submatches (although you may of course replace a char or two in a backrefence, say, r'\g<12>'.replace('2','1') to "obfuscate" \g<11> backreference, but there is little sense in this operation).

Context

python 3.x
using re.sub to do regular expression substitution
apply arbitrary modification on part of a string, without modifying the whole string

Problem scenario

UserMattL/uu002matt1675257544 wants to match part of a string with regex
user wants to modify the matched part of the string

Solution

The general solution to this scenario is already posted elsewhere in this thread
This answer shows a simpler example that does basically the same thing, but using python lambda instead of declaring a standalone function

Example

user has a string matching a MSFT Windows filepath specification
user wants to change the drive letter to lowercase, but without modifying any other part of the string
regex is appropriate here, since the drive letter can be any character, making str.replace() impractical

  import re
  ss7676demotest = 'D:/AlphaOne/BravoTwo'
  rx7676demotest = re.compile(r'^(\w):')
  ss7676demotest = re.sub(rx7676demotest, lambda obmatch: '{vjj}:'.format(vjj=obmatch.group(1).lower()), ss7676demotest,)
  print(ss7676demotest) ## d:/AlphaOne/BravoTwo

Rationale

python lambda allows user to minimize the amount of code

Pitfalls

the question talks about modifying a backreference, but that's actually not what's going on here
- backreference is a regex concept for specifying subregions of a string that match a specific part of a regex
- in this context, we want to specify one of those matched subregions so we can modify it without affecting any other part of the string
- we do that with python regex.match() object (eg, obmatch.group(1) in this example)
python lambda is practical in simple scenarios that do not involve a lot of logic, but not always favorable
- for more elaborate transformations, it is favorable to write a standalone function instead of using a lambda
- to enhance readability
- to make the code more maintainable
- see the other answer in this thread for the alternate non-lambda approach

继续阅读：backreference python regex

Python - Modifying a backreference. Can it be done?

Context

Problem scenario

Solution

Example

Rationale

Pitfalls

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Context

Problem scenario

Solution

Example

Rationale

Pitfalls

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？