Two very close regexes with lookahead assertions in Python - why does re.split() behave differently?
I was trying to anser this question where the OP has the following string:
"path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism"
and wants to split it to obtain the following list:
['path:bte00250 Alanine, aspartate and glutamate metabolism', 'path:bte00330 Arginine and proline metabolism']
I tried to solve it by using a simple lookahead assertion in a regex, (?=path:)
. Well, it did not work:
>>> s = "path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism"
>>> r = re.compile('(?=path:)')
>>> r.split(s)
['path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism']
However, in this answer, the answerer got it wor开发者_Python百科king by preceding the lookahead assertion with a whitespace:
>>> line = 'path:bte00250 Alanine, aspartate and glutamate metabolism path:bte00330 Arginine and proline metabolism'
>>> re.split(' (?=path:)', line)
['path:bte00250 Alanine, aspartate and glutamate metabolism', 'path:bte00330 Arginine and proline metabolism']
Why did the regex work with the whitespace? Why did it not work without the whitespace?
Python's re.split()
has a documented limitation: It can't split on zero-length matches. Therefore the split only worked with the added space.
精彩评论