Arithmetic operations in regex

2023-03-23 09:25 问答作者：

I am using gedit regex plugin (Python style regex). I would like to do some ari开发者_运维问答thmetic operation on a backreference to a group.

For example:

PART 1 DATA MODELS Chapter  
2 Entity-Relationship Model 27

I would like to change it to be

PART 1 DATA MODELS Chapter  25
2 Entity-Relationship Model 27

My regex is ^(PART.*)\n(.*\s(\d+))\n, and I would like to replace it with something like \1 (\3-2)\n\2\n where \3-2 is meant to be the backreference \3 minus 2. But the replacing regex is not right. I wonder how to do it? Thanks!

You can pass to re.sub lambda function which takes re.MatchObject object for every non-overlapping pattern match and returns replacement string. For example:

import re    
print re.sub("(\d+)\+(\d+)",
             lambda m: str(int(m.group(1))+int(m.group(2))),
             "If 2+2 is 4 then 1+2+3+4 is 10")

prints

If 4 is 4 then 3+7 is 10

You could easily apply it to your problem.

The following code does what you want on the string you gave as an example. One point is that it is very specific to the format of this one string. It can't manage any variability in the string. It is really limited to only this type of string format.

import re

ss = '''PART 1 DATA MODELS Chapter
2 Entity-Relationship Model 27

The sun is shining

PART 1 DATA MODELS Chapter
13 Entity-Relationship Model 45
'''

regx = re.compile('^(PART.*)(\n(\d*).*\s(\d+)\n)',re.MULTILINE)

def repl(mat):
    return ''.join((mat.group(1),' ',
                    str(int(mat.group(4))-int(mat.group(3))),
                    mat.group(2)))

for mat in regx.finditer(ss):
    print mat.groups()

print

print regx.sub(repl,ss)

result

('PART 1 DATA MODELS Chapter', '\n2 Entity-Relationship Model 27\n', '2', '27')
('PART 1 DATA MODELS Chapter', '\n13 Entity-Relationship Model 45\n', '13', '45')

PART 1 DATA MODELS Chapter 25
2 Entity-Relationship Model 27

The sun is shining

PART 1 DATA MODELS Chapter 32
13 Entity-Relationship Model 45

Edited: I had forgotten the re.MULTILINE flag

I'm not aware that you can do arithmetic or other computations in regexes. If there's a regex engine out there that supports that, it would be really nifty! But my understanding is that wouldn't be practical without hugely slowing down the regex engine.

I think your best bet would be to use the sub regex function/method:

re.sub(pattern, repl, string[, count, flags])
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a linefeed, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:
>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
The pattern may be a string or an RE object.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

You can pass repl as a function that calculates the values to substitute back into the original string.

Unless gedit is a superset of Python, it won't allow operations inside a replace-regex as you're trying to do with (\3-2). In any case, \3 is a string and you'd need to convert with int() first. So you'd have to break it into separate re.search(...), compute the inserted pageno, then insert.

Second issue is that you did not match the pagelength of '2', you hardcoded it in =- did you want your regex to match it from the start of second line?

(Also in any case your multiline match will only match one line following the PART, if that's what you intended.)

Here it is implemented in plain Python regex:

for (chap,sect,page) in re.finditer(r'^(PART.*)\n(.*\s+(\d+))\n', input, re.M):
    print chap, int(page)-2
    print sect

(I tried to wrap that as a repl fn paginate_chapter(matchobj), can't get re.sub to call that reliably yet...)

继续阅读：python regex

Arithmetic operations in regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？