"Counting" with multiple regular expressions matches in Python
Let's say I have the following multi-line string:
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
and I want it to become:
# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
# 3 Section
## 3.1 Subsection
In Python, using the re
module, is it be possible to run a substitution on the string which would:
- Match the beginning of each line based on the number of
#
's - Keep track of past matches of commonly-numbered groups of
#
's - Insert this counter when appropriate into the line
...assuming that any of these 'counters' are always non-zero?
This problem is testing the limits of my regex knowledge. I already know I can just iterate over the lines and increment/insert some variables, so that's 开发者_Python百科not the solution I want. I'm simply curious if this kind of functionality exists solely within a regular expressions, as I know that some sort of counting already exists (e.g., number of substitutions to make).
« Ok, sure, but what if the 'variable manipulation' is being done in a callback function of re.sub, can it be done then? I guess a simplified form of my question is: "Can one use regular expressions to substitue differently based on previous matches?" »
It sounds like we need a generator function as a callback; unfortunately, re.sub() doesn't accept a generator function as a callback.
So we must use some trick:
import re
pat = re.compile('^(#+)',re.MULTILINE)
ch = '''# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
## Subsection
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
### Subsubsection
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
### Subsubsection
## Subsection
### Subsubsection
# Section
## Subsection
'''
def cbk(match, nb = [0] ):
if len(match.group())==len(nb):
nb[-1] += 1
elif len(match.group())>len(nb):
nb.append(1)
else:
nb[:] = nb[0:len(match.group())]
nb[-1] += 1
return match.group()+' '+('.'.join(map(str,nb)))
ch = pat.sub(cbk,ch)
print ch
.
« Default parameter values are evaluated when the function definition is executed. This means that the expression is evaluated once, when the function is defined, and that that same “pre-computed” value is used for each call. This is especially important to understand when a default parameter is a mutable object, such as a list or a dictionary: if the function modifies the object (e.g. by appending an item to a list), the default value is in effect modified. This is generally not what was intended. »
http://docs.python.org/reference/compound_stmts.html#function
But here, it IS my plain intent.
Result:
# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
## 2.2 Subsection
### 2.2.1 Subsubsection
### 2.2.2 Subsubsection
#### 2.2.2.1 Sub4section
#### 2.2.2.2 Sub4section
#### 2.2.2.3 Sub4section
#### 2.2.2.4 Sub4section
##### 2.2.2.4.1 Sub5section
#### 2.2.2.5 Sub4section
##### 2.2.2.5.1 Sub5section
##### 2.2.2.5.2 Sub5section
### 2.2.3 Subsubsection
### 2.2.4 Subsubsection
#### 2.2.4.1 Sub4section
#### 2.2.4.2 Sub4section
## 2.3 Subsection
### 2.3.1 Subsubsection
### 2.3.2 Subsubsection
### 2.3.3 Subsubsection
#### 2.3.3.1 Sub4section
##### 2.3.3.1.1 Sub5section
##### 2.3.3.1.2 Sub5section
### 2.3.4 Subsubsection
#### 2.3.4.1 Sub4section
## 2.4 Subsection
### 2.4.1 Subsubsection
### 2.4.2 Subsubsection
# 3 Section
## 3.1 Subsection
## 3.2 Subsection
# 4 Section
## 4.1 Subsection
### 4.1.1 Subsubsection
#### 4.1.1.1 Sub4section
#### 4.1.1.2 Sub4section
#### 4.1.1.3 Sub4section
##### 4.1.1.3.1 Sub5section
#### 4.1.1.4 Sub4section
### 4.1.2 Subsubsection
## 4.2 Subsection
### 4.2.1 Subsubsection
# 5 Section
## 5.1 Subsection
EDIT 1 : I corrected else nb[:] = nb[0:len(match.group())] to else: only
EDIT 2 : the code can be condensed to
def cbk(match, nb = [0] ):
if len(match.group())>len(nb):
nb.append(1)
else:
nb[:] = nb[0:len(match.group())]
nb[-1] += 1
return match.group()+' '+('.'.join(map(str,nb)))
Regular expressions are for matching strings. They are not for manipulating variables as the matching occurs. You may not like the solution of iterating over each line and counting yourself, but it is a straightforward solution.
Pyparsing packages several of these scan/match/replace tasks up for you into its own parsing framework. Here is an annotated solution to your stated problem:
from pyparsing import LineStart, Word, restOfLine
source = """\
# Section
## Subsection
## Subsection
# Section
## Subsection #
### Subsubsection
### Subsubsection
# Section
## Subsection
"""
# define a pyparsing expression to match a header line starting with some
# number of '#'s (i.e., a "word" composed of '#'s), followed by the rest
# of the line
sectionHeader = LineStart() + Word("#")("level") + restOfLine
# define a callback to keep track of the nesting and numbering
numberstack = [0]
def insertDottedNumber(tokens):
level = len(tokens.level)
if level > len(numberstack):
numberstack.extend([1]*(level-len(numberstack)))
else:
del numberstack[level:]
numberstack[level-1] += 1
dottedNum = '.'.join(map(str,numberstack))
# return the updated string containing the original level and rest
# of the line, with the dotted number inserted
return "%s %s %s" % (tokens.level, dottedNum, tokens[1])
# attach parse-time action callback to the sectionHeader expression
sectionHeader.setParseAction(insertDottedNumber)
# use sectionHeader expression to transform the input source string
newsource = sectionHeader.transformString(source)
print newsource
Prints the desired:
# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection #
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
# 3 Section
## 3.1 Subsection
This is not a job for regular expressions alone, but you may be able to use them to make your job easier. For example, this splits your full text into the major sections by using regular expressions:
>>> p = re.compile(r"^# .*\n^(?:^##.*\n)*", re.M)
>>> p.findall(your_text)
['# Section\n## Subsection\n## Subsection\n', '# Section\n## Subsection\n### Subsubsection\n### Subsubsection\n', '# Section\n']
You could conceivably do something recursive with a regular expression like this to further split the subsections, but you are much better off just looping through the lines.
Use that generator trick by eyquem. If not you could always do a find all in global context then rewrite the stuff in a new buffer.
If its just a one off thing this Perl sample does it all...
use strict;
use warnings;
my $data = '
#
##
##
#
##
###
###
######
#####
####
#####
####
#####
######
#####
##
#
##
';
my @cnts = ();
$data =~ s/^ [^\S\n]* (\#+) [^\S\n]* (.*) $/ callback($1,$2) /xemg;
print $data;
exit(0);
##
sub callback {
my ($pounds, $text) = @_;
my $i = length($pounds) - 1;
if ($i == 0 || $i <= $#cnts) {
@cnts[ ($i+1) .. $#cnts ] = (0) x ($#cnts - $i);
++$cnts[ $i ];
}
else {
@cnts[ ($#cnts+1) .. $i ] = (1) x ($i - $#cnts);
}
my $chapter = $cnts[0];
for my $ndx (1 .. $i) {
$chapter .= ".$cnts[ $ndx]";
}
return "$pounds \t $chapter $text";
}
Output:
# 1
## 1.1
## 1.2
# 2
## 2.1
### 2.1.1
### 2.1.2
###### 2.1.2.1.1.1
##### 2.1.2.1.2
#### 2.1.2.2
##### 2.1.2.2.1
#### 2.1.2.3
##### 2.1.2.3.1
###### 2.1.2.3.1.1
##### 2.1.2.3.2
## 2.2
# 3
## 3.1
My, all the helpfull people at SO
import re
import textwrap
class DefaultList(list):
"""
List having a default value (returned on invalid offset)
>>> t = DefaultList([1,2,3], default=17)
>>> t[104]
17
"""
def __init__(self, *args, **kwargs):
self.default = kwargs.pop('default', None)
super(DefaultList,self).__init__(*args, **kwargs)
def __getitem__(self, y):
if y >= self.__len__():
return self.default
else:
return super(DefaultList,self).__getitem__(y)
class SectionNumberer(object):
"Hierarchical document numberer"
def __init__(self, LineMatcher, Numbertype_list, defaultNumbertype):
"""
@param LineMatcher: line matcher instance (recognize section headings and parse them)
@param Numbertype_list: list of Number classes (do section numbering at each level)
@param defaultNumbertype: default Number class (if too few Number classes specified)
"""
super(SectionNumberer,self).__init__()
self.match = LineMatcher
self.types = DefaultList(Numbertype_list, default=defaultNumbertype)
self.numbers = []
self.title = ''
def addSection(self, level, title):
"Add new section"
depth = len(self.numbers)
if depth < level:
for i in range(depth, level):
self.numbers.append(self.types[i](1))
else:
self.numbers = self.numbers[:level]
self.numbers[-1].inc()
self.title = title
def doLine(self, ln):
"Process section numbering on single-line string"
match = self.match(ln)
if match==False:
return ln
else:
self.addSection(*match)
return str(self)
def __call__(self, s):
"Process section numbering on multiline string"
return '\n'.join(self.doLine(ln) for ln in s.split('\n'))
def __str__(self):
"Get label for current section"
section = '.'.join(str(n) for n in self.numbers)
return "{0} {1}".format(section, self.title)
class LineMatcher(object):
"Recognize section headers and parse them"
def __init__(self, match):
super(LineMatcher,self).__init__()
self.match = re.compile(match)
def __call__(self, line):
"""
@param line: string
Expects that self.match is a valid regex expression
"""
match = re.match(self.match, line)
if match:
return len(match.group(1)), match.group(2)
else:
return False
# Recognize section headers that look like '### Section_title'
PoundLineMatcher = lambda: LineMatcher(r'([#]+) (.*)')
class Numbertype(object):
def __init__(self, startAt=0, valueType=int):
super(Numbertype,self).__init__()
self.value = valueType(startAt)
def inc(self):
self.value += 1
def __str__(self):
return str(self.value)
class Roman(int):
CODING = [
(1000, 'M'),
( 900, 'CM'), ( 500, 'D'), ( 400, 'CD'), ( 100, 'C'),
( 90, 'XC'), ( 50, 'L'), ( 40, 'XL'), ( 10, 'X'),
( 9, 'IX'), ( 5, 'V'), ( 4, 'IV'), ( 1, 'I')
]
def __add__(self, y):
return Roman(int.__add__(self, y))
def __str__(self):
value = self.__int__()
if 0 < value < 4000:
result = []
for v,s in Roman.CODING:
while v <= value:
value -= v
result.append(s)
return ''.join(result)
else:
raise ValueError("can't generate Roman numeral string for {0}".format(value))
IntNumber = Numbertype
RomanNumber = lambda x=1: Numbertype(x, Roman)
def main():
test = textwrap.dedent("""
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
""")
numberer = SectionNumberer(PoundLineMatcher(), [IntNumber, RomanNumber, IntNumber], IntNumber)
print numberer(test)
if __name__=="__main__":
main()
turns
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
into
1 Section
1.I Subsection
1.II Subsection
2 Section
2.I Subsection
2.I.1 Subsubsection
2.I.2 Subsubsection
3 Section
3.I Subsection
精彩评论