Comparing two files for identical lines where the order doesn't matter
I have two files (which could be up to 150,000 lines long; each line is 160 bytes), which I'd like to check to see if the lines in each are the same. diff
won't work for me (directly) because a small percentage of the lines occur in a different order in the two files. Typically, a pair of lines will be transposed.
Although it's a slightly expensive way to do it (for anything larger I'd rethink this), I'd fire up python and do the following:
filename1 = "WHATEBVER YOUR FILENAME IS"
filename2 = "WHATEVER THE OTHER ONE IS"
file1contents = set(open(filename1).readlines())
file2contents = set(open(filename2).readlines())
if file1contents == file2contents:
print "Yup they're the same!"
else:
print "Nope, they differ. In file2, not file1:\n\n"
for diffLine in file2contents - file1contents:
print "\t", diffLine
print "\n\nIn file1, not file2:\n\n"
for diffLine in file1contents - file2contents:
print "\t", diffLine
That'll print the different lines if they differ.
For only 150k lines, just hash each line and store them ordered in a lookup table. Then for each line in file two just perform the lookup.
Another python script to do this:
#!/usr/bin/env python
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]
lines1 = open(file1,'r').readlines()
lines2 = open(file2,'r').readlines()
lines1.sort()
lines2.sort()
s = ''
for i,line in enumerate(lines1):
if lines2[i] != line:
print '> %s' % line
print '< %s' % lines2[i]
s = 'not'
print 'file %s is %s like file %s' % (file1, s, file2)
精彩评论