Regular expression to replace with XML node

2022-12-13 00:56 问答作者：

I'm using Python to write a regular expression for replacing parts of the string with a XML node.

The source string looks like:

Hello
REPLACE(str1) this is to replace
开发者_运维技巧REPLACE(str2) this is to replace

And the result string should be like:

Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>

Can anyone help me?

What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.

Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:

import re


s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)

s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""


def mksub(m):
    return '<replace name="%s">%s</replace>' % m.groups()


s_output = re.sub(pat, mksub, s_input)

The only tricky part is the regular expression pattern. Let's look at it in detail.

^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.

\s* matches optional whitespace.

REPLACE matches the literal string "REPLACE".

\( matches the literal string "(".

( begins a "match group".

[^)] means "match any character but a ")".

+ means "match one or more of the preceding pattern.

) closes a "match group".

\) matches the literal string ")"

(.*) is another match group containing ".*".

$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.

. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.

So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.

EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:

import re

s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)

s_repl = r'<replace name="\1">\2</replace>'

s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""


s_output = re.sub(pat, s_repl, s_input)

Here is an excellent tutorial on how to write regular expressions in Python.

Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.

src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""

from pyparsing import Suppress, Word, alphas, alphanums, restOfLine

LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
    lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
    )

print replExpr.transformString(src)

In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in @steveha's solution.

In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:

s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"

Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:

class CallableDict(object):
    def __init__(self,fn):
        self.fn = fn
    def __getitem__(self,name):
        return self.fn(name)

def mksub(m):    
    return '<replace name="%(name)s">%(body)s</replace>' %  CallableDict(m.group)

s_output = re.sub(pat, mksub, s_input)

Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.

Maybe like this ?

import re

mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""

prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')

for line in mystr.split("\n"):
    print prog.sub(r'< replace name="\1" > \2',line)

Something like this should work:

import re,sys

f = open( sys.argv[1], 'r' )
for i in f:
    g = re.match( r'REPLACE\((.*)\)(.*)', i )
    if g is None:
        print i
    else:
        print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()

import re

a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""

regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)

b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)

print b

will do the replace in one line.

继续阅读：python regex

Regular expression to replace with XML node

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？