Python Regex Question
I have an end tag followed by a carriage return line fee开发者_运维技巧d (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.
Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+
may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s*
(zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.
精彩评论