"Counting" with multiple regular expressions matches in Python

2023-02-10 12:27 问答作者：

Let's say I have the following multi-line string:

# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection

and I want it to become:

# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
# 3 Section
## 3.1 Subsection

In Python, using the re module, is it be possible to run a substitution on the string which would:

Match the beginning of each line based on the number of #'s
Keep track of past matches of commonly-numbered groups of #'s
Insert this counter when appropriate into the line

...assuming that any of these 'counters' are always non-zero?

This problem is testing the limits of my regex knowledge. I already know I can just iterate over the lines and increment/insert some variables, so that's 开发者_Python百科not the solution I want. I'm simply curious if this kind of functionality exists solely within a regular expressions, as I know that some sort of counting already exists (e.g., number of substitutions to make).

« Ok, sure, but what if the 'variable manipulation' is being done in a callback function of re.sub, can it be done then? I guess a simplified form of my question is: "Can one use regular expressions to substitue differently based on previous matches?" »

It sounds like we need a generator function as a callback; unfortunately, re.sub() doesn't accept a generator function as a callback.

So we must use some trick:

import re

pat = re.compile('^(#+)',re.MULTILINE)

ch = '''# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
## Subsection
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
### Subsubsection
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
### Subsubsection
## Subsection
### Subsubsection
# Section
## Subsection
'''

def cbk(match, nb = [0] ):
    if len(match.group())==len(nb):
        nb[-1] += 1
    elif  len(match.group())>len(nb):
        nb.append(1)
    else:
        nb[:] = nb[0:len(match.group())]
        nb[-1] += 1
    return match.group()+' '+('.'.join(map(str,nb)))

ch = pat.sub(cbk,ch)
print ch

« Default parameter values are evaluated when the function definition is executed. This means that the expression is evaluated once, when the function is defined, and that that same “pre-computed” value is used for each call. This is especially important to understand when a default parameter is a mutable object, such as a list or a dictionary: if the function modifies the object (e.g. by appending an item to a list), the default value is in effect modified. This is generally not what was intended. »

http://docs.python.org/reference/compound_stmts.html#function

But here, it IS my plain intent.

Result:

# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
## 2.2 Subsection
### 2.2.1 Subsubsection
### 2.2.2 Subsubsection
#### 2.2.2.1 Sub4section
#### 2.2.2.2 Sub4section
#### 2.2.2.3 Sub4section
#### 2.2.2.4 Sub4section
##### 2.2.2.4.1 Sub5section
#### 2.2.2.5 Sub4section
##### 2.2.2.5.1 Sub5section
##### 2.2.2.5.2 Sub5section
### 2.2.3 Subsubsection
### 2.2.4 Subsubsection
#### 2.2.4.1 Sub4section
#### 2.2.4.2 Sub4section
## 2.3 Subsection
### 2.3.1 Subsubsection
### 2.3.2 Subsubsection
### 2.3.3 Subsubsection
#### 2.3.3.1 Sub4section
##### 2.3.3.1.1 Sub5section
##### 2.3.3.1.2 Sub5section
### 2.3.4 Subsubsection
#### 2.3.4.1 Sub4section
## 2.4 Subsection
### 2.4.1 Subsubsection
### 2.4.2 Subsubsection
# 3 Section
## 3.1 Subsection
## 3.2 Subsection
# 4 Section
## 4.1 Subsection
### 4.1.1 Subsubsection
#### 4.1.1.1 Sub4section
#### 4.1.1.2 Sub4section
#### 4.1.1.3 Sub4section
##### 4.1.1.3.1 Sub5section
#### 4.1.1.4 Sub4section
### 4.1.2 Subsubsection
## 4.2 Subsection
### 4.2.1 Subsubsection
# 5 Section
## 5.1 Subsection

EDIT 1 : I corrected else nb[:] = nb[0:len(match.group())] to else: only

EDIT 2 : the code can be condensed to

def cbk(match, nb = [0] ):
    if len(match.group())>len(nb):
        nb.append(1)
    else:
        nb[:] = nb[0:len(match.group())]
        nb[-1] += 1
    return match.group()+' '+('.'.join(map(str,nb)))

Regular expressions are for matching strings. They are not for manipulating variables as the matching occurs. You may not like the solution of iterating over each line and counting yourself, but it is a straightforward solution.

Pyparsing packages several of these scan/match/replace tasks up for you into its own parsing framework. Here is an annotated solution to your stated problem:

from pyparsing import LineStart, Word, restOfLine

source = """\
# Section 
## Subsection 
## Subsection 
# Section 
## Subsection #
### Subsubsection 
### Subsubsection 
# Section 
## Subsection 
"""

# define a pyparsing expression to match a header line starting with some 
# number of '#'s (i.e., a "word" composed of '#'s), followed by the rest 
# of the line
sectionHeader = LineStart() + Word("#")("level") + restOfLine

# define a callback to keep track of the nesting and numbering
numberstack = [0]
def insertDottedNumber(tokens):
    level = len(tokens.level)
    if level > len(numberstack):
        numberstack.extend([1]*(level-len(numberstack)))
    else:
        del numberstack[level:]
        numberstack[level-1] += 1

    dottedNum = '.'.join(map(str,numberstack))

    # return the updated string containing the original level and rest
    # of the line, with the dotted number inserted
    return "%s %s %s" % (tokens.level, dottedNum, tokens[1])

# attach parse-time action callback to the sectionHeader expression
sectionHeader.setParseAction(insertDottedNumber)

# use sectionHeader expression to transform the input source string
newsource = sectionHeader.transformString(source)
print newsource

Prints the desired:

# 1  Section 
## 1.1  Subsection 
## 1.2  Subsection 
# 2  Section 
## 2.1  Subsection #
### 2.1.1  Subsubsection 
### 2.1.2  Subsubsection 
# 3  Section 
## 3.1  Subsection

This is not a job for regular expressions alone, but you may be able to use them to make your job easier. For example, this splits your full text into the major sections by using regular expressions:

>>> p = re.compile(r"^# .*\n^(?:^##.*\n)*", re.M)
>>> p.findall(your_text)
['# Section\n## Subsection\n## Subsection\n', '# Section\n## Subsection\n### Subsubsection\n### Subsubsection\n', '# Section\n']

You could conceivably do something recursive with a regular expression like this to further split the subsections, but you are much better off just looping through the lines.

Use that generator trick by eyquem. If not you could always do a find all in global context then rewrite the stuff in a new buffer.

If its just a one off thing this Perl sample does it all...

use strict;
use warnings;

my $data = '
 # 
 ## 
 ## 
 # 
 ## 
 ### 
 ### 
 ###### 
 ##### 
 ####  
 ##### 
 #### 
 ##### 
 ###### 
 ##### 
 ## 
 # 
 ## 
 ';

my @cnts = ();

$data =~ s/^ [^\S\n]* (\#+) [^\S\n]* (.*) $/ callback($1,$2) /xemg;

print $data;

exit(0);

##
 sub callback {
    my ($pounds, $text) = @_;
    my $i = length($pounds) - 1;
    if ($i == 0 || $i <= $#cnts) {
        @cnts[ ($i+1) .. $#cnts ] = (0) x ($#cnts - $i);
        ++$cnts[ $i ];
    }
    else {
        @cnts[ ($#cnts+1) .. $i ] = (1) x ($i - $#cnts);
    }
    my $chapter = $cnts[0];
    for my $ndx (1 .. $i) {
        $chapter .= ".$cnts[ $ndx]";
    }
    return "$pounds \t $chapter $text";
 }

Output:

#        1
##       1.1
##       1.2
#        2
##       2.1
###      2.1.1
###      2.1.2
######   2.1.2.1.1.1
#####    2.1.2.1.2
####     2.1.2.2
#####    2.1.2.2.1
####     2.1.2.3
#####    2.1.2.3.1
######   2.1.2.3.1.1
#####    2.1.2.3.2
##       2.2
#        3
##       3.1

My, all the helpfull people at SO

import re
import textwrap

class DefaultList(list):
    """
    List having a default value (returned on invalid offset)

    >>> t = DefaultList([1,2,3], default=17)
    >>> t[104]
    17
    """
    def __init__(self, *args, **kwargs):
        self.default = kwargs.pop('default', None)
        super(DefaultList,self).__init__(*args, **kwargs)

    def __getitem__(self, y):
        if y >= self.__len__():
            return self.default
        else:
            return super(DefaultList,self).__getitem__(y)

class SectionNumberer(object):
    "Hierarchical document numberer"
    def __init__(self, LineMatcher, Numbertype_list, defaultNumbertype):
        """
        @param LineMatcher:       line matcher instance  (recognize section headings and parse them)
        @param Numbertype_list:   list of Number classes (do section numbering at each level)
        @param defaultNumbertype: default Number class   (if too few Number classes specified)
        """
        super(SectionNumberer,self).__init__()
        self.match   = LineMatcher
        self.types   = DefaultList(Numbertype_list, default=defaultNumbertype)
        self.numbers = []
        self.title   = ''

    def addSection(self, level, title):
        "Add new section"
        depth = len(self.numbers)
        if depth < level:
            for i in range(depth, level):
                self.numbers.append(self.types[i](1))
        else:
            self.numbers = self.numbers[:level]
            self.numbers[-1].inc()

        self.title = title

    def doLine(self, ln):
        "Process section numbering on single-line string"
        match = self.match(ln)
        if match==False:
            return ln
        else:
            self.addSection(*match)
            return str(self)

    def __call__(self, s):
        "Process section numbering on multiline string"
        return '\n'.join(self.doLine(ln) for ln in s.split('\n'))

    def __str__(self):
        "Get label for current section"
        section = '.'.join(str(n) for n in self.numbers)
        return "{0} {1}".format(section, self.title)

class LineMatcher(object):
    "Recognize section headers and parse them"
    def __init__(self, match):
        super(LineMatcher,self).__init__()
        self.match = re.compile(match)

    def __call__(self, line):
        """
        @param line: string

        Expects that self.match is a valid regex expression
        """
        match = re.match(self.match, line)
        if match:
            return len(match.group(1)), match.group(2)
        else:
            return False

# Recognize section headers that look like '### Section_title'
PoundLineMatcher = lambda: LineMatcher(r'([#]+) (.*)')

class Numbertype(object):
    def __init__(self, startAt=0, valueType=int):
        super(Numbertype,self).__init__()
        self.value = valueType(startAt)

    def inc(self):
        self.value += 1

    def __str__(self):
        return str(self.value)

class Roman(int):
    CODING = [
        (1000, 'M'),
        ( 900, 'CM'), ( 500, 'D'), ( 400, 'CD'), ( 100, 'C'),
        (  90, 'XC'), (  50, 'L'), (  40, 'XL'), (  10, 'X'),
        (   9, 'IX'), (   5, 'V'), (   4, 'IV'), (   1, 'I')
    ]

    def __add__(self, y):
        return Roman(int.__add__(self, y))

    def __str__(self):
        value = self.__int__()
        if 0 < value < 4000:
            result = []
            for v,s in Roman.CODING:
                while v <= value:
                    value -= v
                    result.append(s)
            return ''.join(result)
        else:
            raise ValueError("can't generate Roman numeral string for {0}".format(value))

IntNumber = Numbertype
RomanNumber = lambda x=1: Numbertype(x, Roman)

def main():
    test = textwrap.dedent("""
        # Section
        ## Subsection
        ## Subsection
        # Section
        ## Subsection
        ### Subsubsection
        ### Subsubsection
        # Section
        ## Subsection
    """)

    numberer = SectionNumberer(PoundLineMatcher(), [IntNumber, RomanNumber, IntNumber], IntNumber)
    print numberer(test)

if __name__=="__main__":
    main()

turns

# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection

into

1 Section
1.I Subsection
1.II Subsection
2 Section
2.I Subsection
2.I.1 Subsubsection
2.I.2 Subsubsection
3 Section
3.I Subsection

继续阅读：python regex

"Counting" with multiple regular expressions matches in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？