extracting unique values between 2 sets/files

2023-02-05 17:40 问答作者：

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

text file 2 contains:

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

How do I do this from the command l开发者_StackOverflowine?

many thanks!

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7

Explanation of how the code works:

If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

Using some lesser-known utilities:

sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

I was wondering which of the following solutions was the "fastest" for "larger" files:

awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2

Results of my benchmarks in short:

Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
  321599   321599  8098710 file1
  321603   321603  8098794 file2

Typical results of fastest runs

awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004

BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2

How about:

diff file_1 file_2 | grep '^>' | cut -c 3-

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.

with grep:

grep -F -x -v -f file_1 file_2

here's another awk solution

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7

$ cat file1 file1 file2 | sort | uniq -u
6
7

uniq -- report or filter out repeated lines in a file

... Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first.
-u      Only output lines that are not repeated in the input.

Print file1 twice to make sure all entries from file1 are skipped by uniq -u .

cat file1 file2 | sort -u > unique

If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}

This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

继续阅读：bash command-line perl scripting

extracting unique values between 2 sets/files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？