zip() alternative for iterating through two iterables

2022-12-21 05:53 问答作者：

I have two large (~100 GB) text files that must be iterated through simultaneously.

Zip works well for smaller files but I found out that it's actually making a list of lines from my two files. This means that every line gets stored in memory. I don't need to do anything with the lines more than once.

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after t开发者_如何学Chis.

Is there an alternative to zip() that acts as a generator that will allow me to iterate through these two files without using >200GB of ram?

itertools has a function izip that does that

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

If the files are of different sizes you may use izip_longest, as izip will stop at the smaller file.

You can use izip_longest like this to pad the shorter file with empty lines

in python 2.6

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

or in Python 3+

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...

If you want to truncate to the shortest file:

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

Else

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()

Something like this? Wordy, but it seems to be what you're asking for.

It can be adjusted to do things like a proper merge to match keys between the two files, which is often more what's needed than the simplistic zip function. Also, this doesn't truncate, which is what the SQL OUTER JOIN algorithm does, again, different from what zip does and more typical of files.

with open("file1","r") as file1:
    with open( "file2", "r" as file2:
        for line1, line2 in parallel( file1, file2 ):
            process lines

def parallel( file1, file2 ):
    if1_more, if2_more = True, True
    while if1_more or if2_more:
        line1, line2 = None, None # Assume simplistic zip-style matching
        # If you're going to compare keys, then you'd do that before
        # deciding what to read.
        if if1_more:
            try:
                line1= file1.next()
            except StopIteration:
                if1_more= False
        if if2_more:
            try:
                line2= file2.next()
            except StopIteration:
                if2_more= False
        yield line1, line2

继续阅读：python

zip() alternative for iterating through two iterables

更多精彩内容

精彩评论

最新问答

篮球世界杯亚洲赛区预选赛赛程？

逆水寒秦时明月联动什么时候开始?？

输卵管堵塞脱敏治疗多少钱？

装了天猫魔盒以后就不用缴费了吗,是可以永久看的吗?？

造影前同房有什么影响？

问答排行榜

Escaping "<" in Perl-generated XML

Is it allowed to ask users to enter credit card details for own payment method?

imessage会显示已读吗？

微信重新建群怎么建？

Heroku and DB GUI