How to Compare 2 very large matrices using Python

2023-01-17 12:57 问答作者：

I have an interesting problem.

I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.

I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.

The issue is that the file is very large, so I need a solution that can handle the larg开发者_Go百科e file size and execute efficiently within a reasonable time, the faster the better.

Example of what a file and its new file might look like:

Old file

A,2008-01-01,23

A,2008-02-01,45

B,2008-01-01,56

B,2008-02-01,60

C,2008-01-01,3

C,2008-02-01,7

C,2008-03-01,9

etc...

New file

A,2008-01-01,23

A,2008-02-01,45

A,2008-03-01,67 (added)

B,2008-01-01,56

B,2008-03-01,33 (removed and added)

C,2008-01-01,3

C,2008-02-01,7

C,2008-03-01,22 (modified)

etc...

Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.

Like this.

Step 1. Sort.

Step 2. Read each file, doing line-by-line comparison. Write differences to another file.

You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html

Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.

And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

This is a little bit of a naive implementation but will deal with unsorted data:

import csv

file1_dict = {}
file2_dict = {}

with open('file1.csv') as handle:
    for row in csv.reader(handle):
        file1_dict[tuple(row[:2])] = row[2:]

with open('file2.csv') as handle:
    for row in csv.reader(handle):
        file2_dict[tuple(row[:2])] = row[2:]

with open('outfile.csv', 'w') as handle:
    writer = csv.writer(handle)
    for key, val in file1_dict.iteritems():
        if key in file2_dict:
            #deal with keys that are in both
            if file2_dict[key] == val:          
                writer.writerow(key+val+('Same',))
            else:
                writer.writerow(key+file2_dict[key]+('Modified',))
            file2_dict.pop(key)
        else:
            writer.writerow(key+val+('Removed',))
    #deal with added keys!  
    for key, val in file2_dict.iteritems():
        writer.writerow(key+val+('Added',))

You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. @S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.

Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:

#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap

def str2date(in):
    return datetime.date(*map(int,in.split('-')))

def convert_tups(row):
    key = (row[0], str2date(row[1]))
    val = tuple(row[2:])
    return key, val

with open('file1.csv') as handle1:
    with open('file2.csv') as handle2:
        with open('outfile.csv', 'w') as outhandle:
            writer = csv.writer(outhandle)
            gen1 = imap(convert_tups, csv.reader(handle1))
            gen2 = imap(convert_tups, csv.reader(handle2))
            gen2key, gen2val = gen2.next()      
            for gen1key, gen1val in gen1:
                if gen1key == gen2key and gen1val == gen2val:
                    writer.writerow(gen1key+gen1val+('Same',))
                    gen2key, gen2val = gen2.next()
                elif gen1key == gen2key and gen1val != gen2val:
                    writer.writerow(gen2key+gen2val+('Modified',))
                    gen2key, gen2val = gen2.next()
                elif gen1key > gen2key:
                    while gen1key>gen2key:
                        writer.writerow(gen2key+gen2val+('Added',))
                        gen2key, gen2val = gen2.next()
                else:
                    writer.writerow(gen1key+gen1val+('Removed',))

继续阅读：data-structures django matrix python

How to Compare 2 very large matrices using Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？