开发者

Finding Set Complement in Unix

Given this two files:

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

I want to find lines number that is in A "BUT NOT" in B. What's the unix command for it?

I tried this but seems to fai开发者_如何学JAVAl:

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g' 


comm -2 -3 <(sort A.txt) <(sort B.txt)

should do what you want, if I understood you correctly.

Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4


you can try this

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4


note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result

also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120


I wrote a program recently called Setdown that does Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.


Here is another way to do it with join:

join -v1 <(sort A.txt) <(sort B.txt)

From the documentation on join:

‘-v file-number’ Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜