Adding anchors to h2 in text using python and regexp
I'm trying to add anchors to all h2's in my html, using python. This code will add those anchors, but I need to fill the name of the anchors too.
Any idea if the name can be the number of the match in the loop or a slugified version of the text between the h2 tags?
Here's the code s开发者_如何转开发o far:
regex = '(?P<name><h2>.*?</h2>)'
text = re.sub(regex, "<a name=''/>"+r"\g<name>", text)
You can take advantage of the fact that the second argument to re.sub
can be a function to do pretty much anything you'd like. Here's an example that will slugify the text inside the <h2>
element:
regex = '(?P<name><h2>(.*?)</h2>)' # Note the extra group inside the <h2>
def slugify(s):
return s.replace(' ', '-') # bare-bones slugify
def anchorize(matchobj):
return '<a name="%s"/>%s' % (slugify(matchob.group(2)), matchobj.group(1))
text = re.sub(regex, anchorize, text)
(That slugify
function could obviously use some work.)
You could also implement a counter with a version of anchorize
that used a global counter or, better yet, a class that kept track of its own counter and implemented the special __call__
method.
Not sure if I understand correctly, but is placing the author as the name attribute sufficient? Maybe you could use (as long as the author name doesn't contain invalid chars for an attribute):
regex = '(?P<name><h2>(.*?)</h2>)'
print re.sub(regex, "<a name='\g<2>'/>"+r"\g<name>", text)
If you need a more advanced substitution method, parsing the author name or looking up some sort of related id, you could define a replacement function (see re substitute doc):
def name_substitution(matchobj):
name = matchobj.group(2)
# do some processing on name here ...
name = name.replace(' ', '_')
return "<a name='%s'>%s</a>" % (name, matchobj.group(0))
print re.sub(regex, substitution, text)
精彩评论