开发者

Why is the following negative lookahead is not working

import re
txt =  'harry potter is awsome  so is harry james potter'
pat = '\W+(?!potter)'
re.findall(pat,txt)

according to my understanding the the output should have 开发者_StackOverflow中文版been all the words that are not followed by potter that is

['potter', 'is', 'awsome', 'so', 'is', 'harry', 'james', 'potter']

but the actual output is

['harry', 'potter', 'is', 'awsome', 'so', 'is', 'harry', 'james', 'potter']

why is the pattern also matching the harry that is followed by potter ?


because " potte" doesn't match "potter".

>>> txt = 'harry potter is awsome  so is harry james potter'
>>> pat = '(\w+)(?:\W|\Z)(?!potter)'
>>> re.findall(pat,txt)
['potter', 'is', 'awsome', 'so', 'is', 'harry', 'potter']


according to my understanding the the output should have been all the words that are not followed by potter

It does. The thing is, every word is not followed by potter, because every word, by definition, is followed by either whitespace or the end of the string.


I get this result:

[' ', ' ', '  ', ' ', ' ', ' ']

...which is exactly what I expect. \W+ (note the uppercase W) matches one or more non-word characters, so \W+(?!potter) matches the whitespace between the words in your input, except when the upcoming word starts with "potter". If I wanted to match each word that's not followed by the word "potter" I would use this regex:

pat = r'\b\w+\b(?!\W+potter\b)'

\b matches a word boundary; the first two insure that I'm matching a whole word, and the last one makes sure the upcoming word is "potter" and not a longer word that starts with "potter".

Notice how I used raw string (r'...'). You should get in the habit of using them for all your regexes in Python. In this case, \b would be interpreted as a backspace character if I had used a normal string.


import re

txt =  txt =  'harry potter is awsome  so is harry james potter'

pat = r'\w+\b(?![\ ]+potter)'

print re.findall(pat,txt)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜