Finding Set Complement in Unix
Given this two files:
$ cat A.txt $ cat B.txt
3 11
5 1
1 12
2 3
4 2
I want to find lines number that is in A "BUT NOT" in B. What's the unix command for it?
I tried this but seems to fai开发者_如何学JAVAl:
comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'
comm -2 -3 <(sort A.txt) <(sort B.txt)
should do what you want, if I understood you correctly.
Edit: Actually, comm
needs the files to be sorted in lexicographical order, so you don't want -n
in your sort
command:
$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4
you can try this
$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4
note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result
also note that comm
doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm
will leave the "extra" line(s) in the result:
$ cat A.txt
120
121
122
122
$ cat B.txt
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122
if this behavior is undesired, use sort -u
to remove duplicates (only the dupes in A matter):
$ comm -23 <(sort -u A.txt) <(sort B.txt)
120
I wrote a program recently called Setdown that does Set operations from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!
At any rate, I think that it's pretty cool and you should totally check it out.
Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.
Here is another way to do it with join
:
join -v1 <(sort A.txt) <(sort B.txt)
From the documentation on join
:
‘-v file-number’ Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.
精彩评论