开发者

Replace character in HTML tag with REGEX [duplicate]

This questio开发者_StackOverflow中文版n already has answers here: Closed 11 years ago.

Possible Duplicate:

Replace all < and > that are NOT part of an HTML tag

  1. Using Python
  2. I know how much everyone here hates REGEX questions surrounding HTML tags, but I am just doing this as a exercise to help my learn REGEX.

Replace (1 can be any character):

<b>< </b>
<b> < </b>
<b> <</b>
<b><</b>
<b><111</b>
<b>11<11</b>
<b>111<</b>
<b>11<11</b>

<b>
<<<
</b>

With:

<b>& </b>
<b> & </b>
<b> &</b>
<b>&</b>
<b>&111</b>
<b>11&11</b>
<b>111&</b>
<b>11&11</b>

<b>
&
</b>

I am searched in the interwebs and tried many of my own solutions. Please, is this possible? And if so, how?

My best guess was something like:

re.sub(r'(?<=>)(.*?)<(.*?)(?=</)', r'\1&lt;\2', string)

But that falls apart with re.DOTALL and '<<<'+ etc.


I sincerely hope this is never used on actual HTML, but here is a solution that works for your example data. Note that it replaces with &lt; like your sample code, not & like in your sample data.

re.sub(r'<+([^<>]*?)(?=</)', r'&lt;\1', your_string)


You could use something like this:

re.sub(r'(?:<(?!/?b>))+', '&', string)

And if you'd want it to work with (some) other tags, you could use something like this:

re.sub(r'(?:<(?!/?\w+[^<>]*>))+', '&', string)


if a is your string, this seems to work:

re.sub('<+([^b/])','&\\1',a)

and a second version, more generic...

re.sub('(<[^<>]+>)([^<>]*)<+([^<>]*)(<[^<>]+>)','\\1\\2&\\3\\4',a)


This tested regex works for your given test data:

reobj = re.compile(r"""
    # Match left angle brackets not part of HTML tag.
    <+               # One or more < but only if
    (?=[^<>]*</\w+)  # inside HTML element contents.
    """, re.VERBOSE)
result = reobj.sub("&", subject)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜