Comparing files in directory to each other with no repeated comparisons
What I want to do is create a list of files to compare in a directory of N files. The end goal is to compare images to find duplicates开发者_如何转开发 regardless of the format. Given the files 1.jpg 2.jpg 3.jpg.
Using this
import sys,os,time
def main(argv):
list1 = os.listdir(argv[0])
list2 = os.listdir(argv[0])
file_compare_list = []
for pic1 in list1:
for pic2 in list2:
file_compare_list.append([pic1,pic2])
print file_compare_list
if __name__ == "__main__":
main(sys.argv[1:])
I get a list like this
[['1.jpg', '1.jpg'], #0
['1.jpg', '2.jpg'], #1
['1.jpg', '3.jpg'], #2
['2.jpg', '1.jpg'], #3
['2.jpg', '2.jpg'], #4
['2.jpg', '3.jpg'], #5
['3.jpg', '1.jpg'], #6
['3.jpg', '2.jpg'], #7
['3.jpg', '3.jpg']] #8
Now I could go through the file and be assured that each file will be compared but there are obvious duplicates. Index 0, 4, and 8 are easy to take care of I can compare them by file name and get rid of them. What I am more concerned with is stuff like index 2 and 6 where if I did something it would be a duplicate. Any help with this would be greatly appreciated.
You need itertools.combinations
. This code prints exactly what you need:
import os, itertools
files = os.listdir("/path/to/files")
for file1, file2 in itertools.combinations(files, 2):
print file1, file2
And some theory behind it: http://en.wikipedia.org/wiki/Combination
there is always itertools.combinations
:
import itertools
my_list=['1.jpg','2.jpg','3.jpg']
my_combinations = [x for x in itertools.combinations(my_list,2)]
my_combinations will be:
[('1.jpg', '2.jpg'), ('1.jpg', '3.jpg'), ('2.jpg', '3.jpg')]
How's this for a hint?
Instead of computing all off-diagonal elements of the comparison matrix P x P
:
P = {A, B, C, D, ...}
+ A + B + C + D + ...
A | | * | * | * | ...
B | * | | * | * | ...
C | * | * | | * | ...
D | * | * | * | | ...
| | | | |
you can compute either the upper triangle:
+ A + B + C + D + ...
A | | * | * | * | ...
B | | | * | * | ...
C | | | | * | ...
D | | | | | ...
| | | | |
or the lower triangle:
+ A + B + C + D + ...
A | | | | | ...
B | * | | | | ...
C | * | * | | | ...
D | * | * | * | | ...
| | | | |
(from this answer of mine)
Apologies if that was too obtuse. Some actual code:
>>> list = ['a', 'b', 'c', 'd', 'e']
>>> pairs = [[x,y] for i, x in enumerate(list) for y in list[i+1:]]
>>> print pairs
[['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['b', 'c'], ['b', 'd'], ['b', 'e'], ['c', 'd'], ['c', 'e'], ['d', 'e']]
Check out what this does and adapt to your problem:
[(x, y) for x in a for y in a if x < y]
精彩评论