开发者

Python 2 lists compare optimization

Given: Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns. [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]

Goal: Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being dif开发者_StackOverflowferent and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.

Example:

AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]

Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.

How can I make my code faster or change my code to a faster algorithm?

i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []

#create list out of column 0
for y in AllData_2:
    y0list.append(y[0])

for entry in AllData_1:
    x0list.append(entry[0])
    if entry[0] not in y0list:
        #code to save the line to file...

for y0 in AllData_2:
    if y0[0] not in x0list:
        #code to save the line to file...

for yrow in AllData_2:
    i+=1

    for xrow in AllData_1:
        foundit = 0
        if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
            if yrow[1] != xrow[1]:
                col1_diff.append(yrow)
                col1_diff.append(xrow)
                foundit = 1

            elif yrow[2] != xrow[2]:
                col2_diff.append(yrow)
                col2_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) < len(xrow[3]):
                col3a_diff.append(yrow)
                col3a_diff.append(xrow)
                foundit = 1

            elif len(yrow[3]) >= len(xrow[3]):
                col3b_diff.append(yrow)
                col3b_diff.append(xrow)
                foundit = 1

            else:
                #col4 is actually a catch-all for any other differences between lines if [0]s are equal
                col4_diff.append(yrow)
                col4_diff.append(xrow)
                foundit = 1


Right of the top, you can make this a lot smaller.

y0list = []
for y in AllData_2:
    y0list.append(y[0])

is just a verbose way of saying

y0list = [y[0] for y in AllData_2]

And you can use in builtin comparisons. The below

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be expressed as

yrow[1:] != xrow[1:]

which is much less prone to copy/paste errors.

To make it faster, you can avoid doing O(n**2) comparisons. Since you only care when the first column element is the same, you can just bundle them by first element.

index = {}
for yrow in AllData_2:
    key = yrow[0]
    list = index.get(key)
    if list is None:
        list = []
        index[key] = list
    list.append(yrow)

for xrow in AllData_1:
    list = index.get(xrow[0])
    if list is None: continue
    for yrow in list:
        # Do all your comparison here


If you can expect no two lines in a given file to have the same data in column 0, you can significantly improve your code with a few dicts. Instead of the lines

x0list.append(entry[0])
y0list.append(y[0])

You would use:

x0dict[entry[0]] = entry
y0dict[y[0]] = y

after initializing x0dict and y0dict to {}. Then, instead of looping through both complete sets of data again, you can loop over just one of the dicts:

for x0, xrow in x0dict:
    if x0 in y0dict:
        yrow = y0dict[x0]
        # Do the col{1,2,3,4}_diff stuff here

As a bonus, the not in in your second and third loops works the same.


The line

(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])

can be replaced with the nicer-looking

yrow[1:5] != xrow[1:5]

As your code stands right now, i is never used, but if you need that count, it ends up being identical to just saying i = len(AllData_2), since it only increments once per run in a loop over AllData_2.


Finally, your foundit variable currently serves no purpose. It is only used to control the flow with foundit == 0, immediately after setting it to 0, so that will always evaluate to True and setting it has no effect.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜