开发者

How do I get UNIX diff to ignore duplicate lines in different positions?

I have two CSV files about 134 mb.

All I want to do is get the 'diff' of the two files, except the position of a line doesn't matter.

In other words, let's say I have:

abc,123
def,456

and

def,456
ghi,789

I don't want to be told about def,456. It's in a different position in the second file, but I want it to be counted as not being different.

Just doing diff file1 file2 > outputfile isn't working. What command should I use to do 开发者_如何学编程this? I know this is trivial in PHP but I run out of memory quickly. I'd rather just use UNIX command line tools. Diff may not even be the right utility for this.


I would propose that you do a sort on the two input files and then compare the two sorted versions, something like this:

sort file1 > sorted_1
sort file2 > sorted_2

diff sorted_1 sorted_2


Sorry, what diff does is identify differences like that. I think what you want is a tool that identifies:

1
2
3

and:

3
1
2

as being the same. There is no tool I know of that does this (but I might add it to to my http://code.google.com/p/csvfix/ tool at some point).

What you currently need to do is sort both files and then diff them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜