开发者

Python regex catch two kind of comment

Exemple :

a = "bzzzzzz <!-- blabla --> blibli * bloblo * blublu"

I want to catch the first comment. A comment may be

(<!-- .* -->) or (\* .* \*)

That is ok :

re.search("<!--(?P<comment> .* )-->",a).group(1)

Also 开发者_开发知识库that :

re.search("\*(?P<comment> .* )\*",a).group(1)

But if i want one or the other in comment, i have tried something like :

re.search("(<!--(?P<comment> .* )-->|\*(?P<comment> .* )\*)",a).group(1)

But it does't work

Thanks


Try conditional expression:

>>> for m in re.finditer(r"(?:(<!--)|(\*))(?P<comment> .*? )(?(1)-->)(?(2)\*)", a):
...   print m.group('comment')
...
 blabla
 bloblo


the exception you get in the "doesn't work" part is quite explicit about what is wrong:

sre_constants.error: redefinition of group name 'comment' as group 3; was group 2

both groups have the same name: just rename the second one

>>> re.search("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a).group(1)
'<!-- blabla -->'
>>> re.search("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a).groups()
('<!-- blabla -->', ' blabla ', None)
>>> re.findall("(<!--(?P<comment> .* )-->|\*(?P<comment2> .* )\*)",a)
[('<!-- blabla -->', ' blabla ', ''), ('* bloblo *', '', ' bloblo ')]


As Gurney pointed out, you have two captures with the same name. Since you're not actually using the name, just leave that out.

Also, the r"" raw string notation is a good habit.

Oh, and a third thing: you're grabbing the wrong index. 0 is the whole match, 1 is the whole "either-or" block, and 2 will be the inner capture that was successful.

re.search(r"(<!--( .* )-->|\*( .* )\*)",a).group(2)


re.findall might be a better fit for this:

import re

# Keep your regex simple. You'll thank yourself a year from now. Note that
# this doesn't include the surround spaces. It also uses non-greedy matching
# so that you can embed multiple comments on the same line, and it doesn't
# break on strings like '<!-- first comment --> fragment -->'.
pattern = re.compile(r"(?:<!-- (.*?) -->|\* (.*?) \*)")

inputstring = 'bzzzzzz <!-- blabla --> blibli * bloblo * blublu foo ' \
              '<!-- another comment --> goes here'

# Now use re.findall to search the string. Each match will return a tuple
# with two elements: one for each of the groups in the regex above. Pick the
# non-blank one. This works even when both groups are empty; you just get an
# empty string.
results = [first or second for first, second in pattern.findall(inputstring)]


You could go 1 of 2 ways (if supported by Python) -

1: Branch reset (?|pattern|pattern|...)
(?|<!--( .*? )-->|\*( .*? )\*)/ capture group 1 always contains the comment text

2: Conditional expression (?(condition)yes-pattern|no-pattern)
(?:(<!--)|\*)(?P<comment> .*? )(?(1)-->|\*) here the condition is did we capt grp1

Modifiers sg single line and global

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜