How do I get UNIX diff to ignore duplicate lines in different positions?
I have two CSV files about 134 mb.
All I want to do is get the 'diff' of the two files, except the position of a line doesn't matter.
In other words, let's say I have:
abc,123
def,456
and
def,456
ghi,789
I don't want to be told about def,456. It's in a different position in the second file, but I want it to be counted as not being different.
Just doing diff file1 file2 > outputfile isn't working. What command should I use to do 开发者_如何学编程this? I know this is trivial in PHP but I run out of memory quickly. I'd rather just use UNIX command line tools. Diff may not even be the right utility for this.
I would propose that you do a sort
on the two input files and then compare the two sorted versions, something like this:
sort file1 > sorted_1
sort file2 > sorted_2
diff sorted_1 sorted_2
Sorry, what diff does is identify differences like that. I think what you want is a tool that identifies:
1
2
3
and:
3
1
2
as being the same. There is no tool I know of that does this (but I might add it to to my http://code.google.com/p/csvfix/ tool at some point).
What you currently need to do is sort both files and then diff them.
精彩评论