Implement lookahead iterator for strings in Python
I'm doing some parsing that requires one token of lookahead. What I'd like is a fast function (or class?) that would take an iterator and turn it into a list of tuples in the form (token, lookahead), such that:
>>> a = ['a', 'b', 'c', 'd']
>>> list(lookahead(a))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', None)]
basically, this would be handy for looking ahead in iterators like this:
for (token, lookahead_1) in loo开发者_运维百科kahead(a):
pass
Though, I'm not sure if there's a name for this technique or function in itertools that already will do this. Any ideas?
Thanks!
There are easier ways if you are just using lists - see Sven's answer. Here is one way to do it for general iterators
>>> from itertools import tee, izip_longest
>>> a = ['a', 'b', 'c', 'd']
>>> it1, it2 = tee(iter(a))
>>> next(it2) # discard this first value
'a'
>>> [(x,y) for x,y in izip_longest(it1, it2)]
# or just list(izip_longest(it1, it2))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', None)]
Here's how to use it in a for loop like in your question.
>>> it1,it2 = tee(iter(a))
>>> next(it2)
'a'
>>> for (token, lookahead_1) in izip_longest(it1,it2):
... print token, lookahead_1
...
a b
b c
c d
d None
Finally, here's the function you are looking for
>>> def lookahead(it):
... it1, it2 = tee(iter(it))
... next(it2)
... return izip_longest(it1, it2)
...
>>> for (token, lookahead_1) in lookahead(a):
... print token, lookahead_1
...
a b
b c
c d
d None
I like both Sven's and gnibbler's answers, but for some reason, it pleases me to roll my own generator.
def lookahead(iterable, null_item=None):
iterator = iter(iterable) # in case a list is passed
prev = iterator.next()
for item in iterator:
yield prev, item
prev = item
yield prev, null_item
Tested:
>>> for i in lookahead(x for x in []):
... print i
...
>>> for i in lookahead(x for x in [0]):
... print i
...
(0, None)
>>> for i in lookahead(x for x in [0, 1, 2]):
... print i
...
(0, 1)
(1, 2)
(2, None)
Edit: Karl and ninjagecko raise an excellent point -- the sequence passed in may contain None
, and so using None
as the final lookahead value may lead to ambiguity. But there's no obvious alternative; a module-level constant is possibly the best approach in many cases, but may be overkill for a one-off function like this -- not to mention the fact that bool(object()) == True
, which could lead to unexpected behavior. Instead, I've added a null_item
parameter with a default of None
-- that way users can pass in whatever makes sense for their needs, be it a simple object()
sentinel, a constant of their own creation, or even a class instance with special behavior. Since most of the time None
is the obvious and even possibly the expected behavior, I've left None
as the default.
The usual way to do this for a list a
is
from itertools import izip_longest
for token, lookahead in izip_longest(a, a[1:]):
pass
For the last token, you will get None
as look-ahead token.
If you want to avoid the copy of the list introduced by a[1:]
, you can use islice(a, 1, None)
instead. For a slight modification working for arbitrary iterables, see the answer by gnibbler. For a simple, easy to grasp generator function also working for arbitrary iterables, see the answer by senderle.
You might find the answer to your question here: Using lookahead with generators.
I consider all these answers incorrect, because they will cause unforeseen bugs if your list contains None
. Here is my take:
SEQUENCE_END = object()
def lookahead(iterable):
iter = iter(iterable)
current = next(iter)
for ahead in iter:
yield current,ahead
current = ahead
yield current,SEQUENCE_END
Example:
>>> for x,ahead in lookahead(range(3)):
>>> print(x,ahead)
0, 1
1, 2
2, <object SEQUENCE_END>
Example of how this answer is better:
def containsDoubleElements(seq):
"""
Returns whether seq contains double elements, e.g. [1,2,2,3]
"""
return any(val==nextVal for val,nextVal in lookahead(seq))
>>> containsDoubleElements([None])
False # correct!
def containsDoubleElements_BAD(seq):
"""
Returns whether seq contains double elements, e.g. [1,2,2,3]
"""
return any(val==nextVal for val,nextVal in lookahead_OTHERANSWERS(seq))
>>> containsDoubleElements([None])
True # incorrect!
精彩评论