开发者

Python Regex Question

I have an end tag followed by a carriage return line fee开发者_运维技巧d (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .

Something like this:

</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>

What Python regex should I use to replace it with something like this:

</tag1><tag3>content</tag3><tag2>

Thanks in advance.


Here is code for something like what you say that you need:

>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>

Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).

Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜