开发者

Implement lookahead iterator for strings in Python

I'm doing some parsing that requires one token of lookahead. What I'd like is a fast function (or class?) that would take an iterator and turn it into a list of tuples in the form (token, lookahead), such that:

>>> a = ['a', 'b', 'c', 'd']
>>> list(lookahead(a))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', None)]

basically, this would be handy for looking ahead in iterators like this:

for (token, lookahead_1) in loo开发者_运维百科kahead(a):
  pass

Though, I'm not sure if there's a name for this technique or function in itertools that already will do this. Any ideas?

Thanks!


There are easier ways if you are just using lists - see Sven's answer. Here is one way to do it for general iterators

>>> from itertools import tee, izip_longest
>>> a = ['a', 'b', 'c', 'd']
>>> it1, it2 = tee(iter(a))
>>> next(it2)  # discard this first value
'a'
>>> [(x,y) for x,y in izip_longest(it1, it2)]
    # or just list(izip_longest(it1, it2))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', None)]

Here's how to use it in a for loop like in your question.

>>> it1,it2 = tee(iter(a))
>>> next(it2)
'a'
>>> for (token, lookahead_1) in izip_longest(it1,it2):
...     print token, lookahead_1
... 
a b
b c
c d
d None

Finally, here's the function you are looking for

>>> def lookahead(it):
...     it1, it2 = tee(iter(it))
...     next(it2)
...     return izip_longest(it1, it2)
... 
>>> for (token, lookahead_1) in lookahead(a):
...     print token, lookahead_1
... 
a b
b c
c d
d None


I like both Sven's and gnibbler's answers, but for some reason, it pleases me to roll my own generator.

def lookahead(iterable, null_item=None):
    iterator = iter(iterable) # in case a list is passed
    prev = iterator.next()
    for item in iterator:
        yield prev, item
        prev = item
    yield prev, null_item

Tested:

>>> for i in lookahead(x for x in []):
...     print i
... 
>>> for i in lookahead(x for x in [0]):
...     print i
... 
(0, None)
>>> for i in lookahead(x for x in [0, 1, 2]):
...     print i
... 
(0, 1)
(1, 2)
(2, None)

Edit: Karl and ninjagecko raise an excellent point -- the sequence passed in may contain None, and so using None as the final lookahead value may lead to ambiguity. But there's no obvious alternative; a module-level constant is possibly the best approach in many cases, but may be overkill for a one-off function like this -- not to mention the fact that bool(object()) == True, which could lead to unexpected behavior. Instead, I've added a null_item parameter with a default of None -- that way users can pass in whatever makes sense for their needs, be it a simple object() sentinel, a constant of their own creation, or even a class instance with special behavior. Since most of the time None is the obvious and even possibly the expected behavior, I've left None as the default.


The usual way to do this for a list a is

from itertools import izip_longest
for token, lookahead in izip_longest(a, a[1:]):
    pass

For the last token, you will get None as look-ahead token.

If you want to avoid the copy of the list introduced by a[1:], you can use islice(a, 1, None) instead. For a slight modification working for arbitrary iterables, see the answer by gnibbler. For a simple, easy to grasp generator function also working for arbitrary iterables, see the answer by senderle.


You might find the answer to your question here: Using lookahead with generators.


I consider all these answers incorrect, because they will cause unforeseen bugs if your list contains None. Here is my take:

SEQUENCE_END = object()

def lookahead(iterable):
    iter = iter(iterable)
    current = next(iter)
    for ahead in iter:
        yield current,ahead
        current = ahead
    yield current,SEQUENCE_END

Example:

>>> for x,ahead in lookahead(range(3)):
>>>     print(x,ahead)
0, 1
1, 2
2, <object SEQUENCE_END>

Example of how this answer is better:

def containsDoubleElements(seq):
    """
        Returns whether seq contains double elements, e.g. [1,2,2,3]
    """
    return any(val==nextVal for val,nextVal in lookahead(seq))

>>> containsDoubleElements([None])
False  # correct!

def containsDoubleElements_BAD(seq):
    """
        Returns whether seq contains double elements, e.g. [1,2,2,3]
    """
    return any(val==nextVal for val,nextVal in lookahead_OTHERANSWERS(seq))

>>> containsDoubleElements([None])
True  # incorrect!
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜