开发者

python -- callable iterator size?

I am looking through some text file for a certain string with the method.

re.finditer(pattern,text) I would like to know wh开发者_运维知识库en this returns nothing. meaning that it could find nothing in the passed text.

I know that callable iterators, have next() and __iter__

I would like to know if I could get the size or find out if it returns no string matching my pattern.


This solution uses less memory, because it does not save intermediate results, as do other solutions that use list:

sum(1 for _ in re.finditer(pattern, text))

All older solutions have the disadvantage of consuming a lot of memory if the pattern is very frequent in the text, like pattern '[a-z]'.

Test case:

pattern = 'a'
text = 10240000 * 'a'

This solution with sum(1 for ...) uses approximately only the memory for the text as such, that is len(text) bytes. The previous solutions with list can use approximately 58 or 110 times more memory than is necessary. It is 580 MB for 32-bit resp. 1.1 GB for 64-bit Python 2.7.


EDIT 3: The answer by @hynekcer is much much better than this.

EDIT 2: This will not work if you have an infinite iterator, or one which consumes too many Gigabytes (in 2010 1 Gigabyte is still a large amount of ram/ disk space) of RAM/disk space.

You have already seen a good answer, but here is an expensive hack that you can use if you want to eat a cake and have it too :) The trick is that we have to clone the cake, and when you are done eating, we put it back into the same box. Remember, when you iterate over the iterator, it usually becomes empty, or at least loses previously returned values.

>>> def getIterLength(iterator):
    temp = list(iterator)
    result = len(temp)
    iterator = iter(temp)
    return result

>>>
>>> f = xrange(20)
>>> f
xrange(20)
>>> 
>>> x = getIterLength(f)
>>> x
20
>>> f
xrange(20)
>>> 

EDIT: Here is a safer version, but using it still requires some discipline. It does not feel quite Pythonic. You would get the best solution if you posted the whole relevant code sample that you are trying to implement.

>>> def getIterLenAndIter(iterator):
    temp = list(iterator)
    return len(temp), iter(temp)

>>> f = iter([1,2,3,7,8,9])
>>> f
<listiterator object at 0x02782890>
>>> l, f = getIterLenAndIter(f)
>>> 
>>> l
6
>>> f
<listiterator object at 0x02782610>
>>> 


Nope sorry iterators are not meant to know length they just know what's next which makes them very efficient at going through Collections. Although they are faster they do no allow for indexing which including knowing the length of a collection.


You can get the number of elements in an iterator by doing:

len( [m for m in re.finditer(pattern, text) ] )

Iterators are iterators because they have not generated the sequence yet. This above code is basically extracting each item from the iterator until it wants to stop into a list, then taking the length of that array. Something that would be more memory efficient would be:

count = 0
for item in re.finditer(pattern, text):
    count += 1

A tricky approach to the for-loop is to use reduce to effectively count the items in the iterator one by one. This is effectively the same thing as the for loop:

reduce( (lambda x, y : x + 1), myiterator, 0)

This basically ignores the y passed into reduce and just adds one. It initializes the running sum to 0.


While some iterators might be able to know their length (for example, they were created from a string or a list) most do not and cannot. re.iter is a good example of one that cannot know it's length until it is finished.

However, there are a couple different ways to improve your current code:

  • use re.search to find if there are any matches, then use re.finditer to do the actual processing; or

  • use a sentinel value with the for loop.

The second option looks something like:

match = empty = object()
for match in re.finditer(...):
    # do some stuff
if match is empty:
    # there were no matches


A quick solution would be to turn your iterator into a list and check the length of that list, but doing so can be bad for memory if there are too many results.

matches = list(re.finditer(pattern,text))
if matches:
  do_something()
print("Found",len(matches),"matches")
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜