开发者

Arithmetic operations in regex

I am using gedit regex plugin (Python style regex). I would like to do some ari开发者_运维问答thmetic operation on a backreference to a group.

For example:

PART 1 DATA MODELS Chapter  
2 Entity-Relationship Model 27

I would like to change it to be

PART 1 DATA MODELS Chapter  25
2 Entity-Relationship Model 27

My regex is ^(PART.*)\n(.*\s(\d+))\n, and I would like to replace it with something like \1 (\3-2)\n\2\n where \3-2 is meant to be the backreference \3 minus 2. But the replacing regex is not right. I wonder how to do it? Thanks!


You can pass to re.sub lambda function which takes re.MatchObject object for every non-overlapping pattern match and returns replacement string. For example:

import re    
print re.sub("(\d+)\+(\d+)",
             lambda m: str(int(m.group(1))+int(m.group(2))),
             "If 2+2 is 4 then 1+2+3+4 is 10")

prints

If 4 is 4 then 3+7 is 10

You could easily apply it to your problem.


The following code does what you want on the string you gave as an example. One point is that it is very specific to the format of this one string. It can't manage any variability in the string. It is really limited to only this type of string format.

import re

ss = '''PART 1 DATA MODELS Chapter
2 Entity-Relationship Model 27

The sun is shining

PART 1 DATA MODELS Chapter
13 Entity-Relationship Model 45
'''

regx = re.compile('^(PART.*)(\n(\d*).*\s(\d+)\n)',re.MULTILINE)

def repl(mat):
    return ''.join((mat.group(1),' ',
                    str(int(mat.group(4))-int(mat.group(3))),
                    mat.group(2)))

for mat in regx.finditer(ss):
    print mat.groups()

print

print regx.sub(repl,ss)

result

('PART 1 DATA MODELS Chapter', '\n2 Entity-Relationship Model 27\n', '2', '27')
('PART 1 DATA MODELS Chapter', '\n13 Entity-Relationship Model 45\n', '13', '45')

PART 1 DATA MODELS Chapter 25
2 Entity-Relationship Model 27

The sun is shining

PART 1 DATA MODELS Chapter 32
13 Entity-Relationship Model 45

Edited: I had forgotten the re.MULTILINE flag


I'm not aware that you can do arithmetic or other computations in regexes. If there's a regex engine out there that supports that, it would be really nifty! But my understanding is that wouldn't be practical without hugely slowing down the regex engine.

I think your best bet would be to use the sub regex function/method:

re.sub(pattern, repl, string[, count, flags])

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a linefeed, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

The pattern may be a string or an RE object.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

You can pass repl as a function that calculates the values to substitute back into the original string.


Unless gedit is a superset of Python, it won't allow operations inside a replace-regex as you're trying to do with (\3-2). In any case, \3 is a string and you'd need to convert with int() first. So you'd have to break it into separate re.search(...), compute the inserted pageno, then insert.

Second issue is that you did not match the pagelength of '2', you hardcoded it in =- did you want your regex to match it from the start of second line?

(Also in any case your multiline match will only match one line following the PART, if that's what you intended.)

Here it is implemented in plain Python regex:

for (chap,sect,page) in re.finditer(r'^(PART.*)\n(.*\s+(\d+))\n', input, re.M):
    print chap, int(page)-2
    print sect

(I tried to wrap that as a repl fn paginate_chapter(matchobj), can't get re.sub to call that reliably yet...)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜