开发者

zip() alternative for iterating through two iterables

I have two large (~100 GB) text files that must be iterated through simultaneously.

Zip works well for smaller files but I found out that it's actually making a list of lines from my two files. This means that every line gets stored in memory. I don't need to do anything with the lines more than once.

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after t开发者_如何学Chis.

Is there an alternative to zip() that acts as a generator that will allow me to iterate through these two files without using >200GB of ram?


itertools has a function izip that does that

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

If the files are of different sizes you may use izip_longest, as izip will stop at the smaller file.


You can use izip_longest like this to pad the shorter file with empty lines

in python 2.6

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

or in Python 3+

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...


If you want to truncate to the shortest file:

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

Else

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

Or

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()


Something like this? Wordy, but it seems to be what you're asking for.

It can be adjusted to do things like a proper merge to match keys between the two files, which is often more what's needed than the simplistic zip function. Also, this doesn't truncate, which is what the SQL OUTER JOIN algorithm does, again, different from what zip does and more typical of files.

with open("file1","r") as file1:
    with open( "file2", "r" as file2:
        for line1, line2 in parallel( file1, file2 ):
            process lines

def parallel( file1, file2 ):
    if1_more, if2_more = True, True
    while if1_more or if2_more:
        line1, line2 = None, None # Assume simplistic zip-style matching
        # If you're going to compare keys, then you'd do that before
        # deciding what to read.
        if if1_more:
            try:
                line1= file1.next()
            except StopIteration:
                if1_more= False
        if if2_more:
            try:
                line2= file2.next()
            except StopIteration:
                if2_more= False
        yield line1, line2
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜