What is the fastest way to compare two list of items?
I have two folders with approximately 10,000 files each. I'd like to write a script or program that can tell me if these folders are in sync and then tell me which files are missing from each to make them in sync.
Therefore, after generating a list of files, what is the fastest algorithm to sort them for unique files? What I'm thinking right now is comparing t开发者_开发技巧he first file on each list then if they are different remove one until they are the same, then remove both from the list (because they are not unique.)
Is there a faster algorithm than this?
diff -s [path1] [path2]
If you're in C, use qsort() to sort the file lists in ascending order, then use a kind of "merge:
Have two pointers starting at the beginning of each list. Do the following:
- if the names are same, then this name exists in both lists - advance both pointers
- if the name in list1 > name in list2, then list 2 is the only one that has it - advance list2's pointer
- otherwise the name in list1 is in list1 only - advance list1's pointer
- repeat
When you're at the end of one of the lists, all the elements left in the other are obviously missing from the first one.
Alternatively, you can combine both lists while keeping track what list each element comes from. Then sort the combined list. Scan the sorted list. If you see two instances of same value, then it was in both lists. Otherwise you'll know which list it comes from.
Also, another approach you can follow is
If there is no constraint on space, I would go for putting the files of one folder in a hash .. It will take O(N) time and some space..! then I will take each file of the second folder and check if the key exists in the first hash .. this is again O(1) time operation ..! problem solved in O(N) time.. but this is big on space requirement ..
repeat the same in reverse oder depends if u want speed or space..!
Generate md5 or sha1 checksums and compare those. Something like this
cd dir1; md5sum * | sort > /tmp/hash1
cd dir2; md5sum * | sort > /tmp/hash2
diff /tmp/hash1 /tmp/hash2 # could also use comm
If you're only worried about the names, not about the contents of the files, then diff dir1 dir2
works fine.
If you need this information just to synchronize them, you can do the comparison and copying in a single pass:
- Get a directory listing from both directories
- sort both listings lexicographically
- loop simultaneously through both lists:
- if one of the lists is empty, stop the loop
- if both elements are the same: step both indices
- else take the lexicographically lower element, copy it over, and step only this index
- copy any remaining elements of the non-empty list, if existent
If you want to do it in two passes, or need the information what gets copied where, substitute "copy over" with "put name and direction into result list".
精彩评论