awk Merge two files based on common field and print similarities and differences

2023-01-31 19:49 问答作者：

I have two files I would like to merge into a third but I need to see both when they share a common field and where they differ.Since there are minor differences in other fields, I cannot use a diff tool and I thought this could be done with awk.

File 1:

aWonderfulMachine             1   mlqsjflk          
Anoth开发者_如何学编程erWonderfulMachine     2   mlksjf          
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6  qfsdf           
MoreMiscelleneousStuff       7  qsfwsf

File2:

aWonderfulMachine             22    dfhdhg          
aWonderfulMachine             23    dfhh            
aWonderfulMachine             24    qdgfqf          
AnotherWonderfulMachine     25    qsfsq         
AnotherWonderfulMachine     26    qfwdsf            
MoreDifferentStuff           27    qsfsdf           
StrangeStuffBought           28    qsfsdf

Desired output:

aWonderfulMachine   1   mlqsjflk    aWonderfulMachine   22  dfhdhg
                                     aWonderfulMachine  23  dfhdhg
                                     aWonderfulMachine  24  dfhh
AnotherWonderfulMachine 2   mlksjf  AnotherWonderfulMachine 25  qfwdsf
                                       AnotherWonderfulMachine  26  qfwdsf
File1
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6   qfsdf          
MoreMiscelleneousStuff       7   qsfwsf         
File2                   
MoreDifferentStuff          27  qsfsdf          
StrangeStuffBought          28  qsfsdf

I have tried a few awks scripts here and there, but they are either based on two fields only, and I don't know how to modify the output, or they delete the duplicates based on two fields only, etc (I am new to this and awk syntax is tough). Thank you much in advance for your help.

You can come very close using these three commands:

join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)

This assumes a shell, such as Bash, that supports process substitution (<()). If you're using a shell that doesn't, the files would need to be pre-sorted.

To do this in AWK:

#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next }  # process the first file
{   # process the second file and do output
    lines2[$1] = $0;
    count2[$1]++;
    if ($1 != prev) { flag = 1 };
    if (count1[$1]) {
        if (flag) printf "%s ", lines1[$1];
        else printf "\t\t\t\t\t"
        flag = 0;
        printf "\t%s\n", $0
    }
    prev = $1
}
END { # output lines that are unique to one file or the other
    print "File 1: " file1
    for (i in lines1) if (! (i in lines2)) print lines1[i]
    print "File 2: " file2
    for (i in lines2) if (! (i in lines1)) print lines2[i]
}

To run it:

$ ./script.awk file1 file2

The lines won't be output in the same order that they appear in the input files. The second input file (file2) needs to be sorted since the script assumes that similar lines are adjacent. You will probably want to adjust the tabs or other spacing in the script. I haven't done much in that regard.

One way to do it (albeit with hardcoded file names):

BEGIN {
    FS="\t"; 
    readfile(ARGV[1], s1); 
    readfile(ARGV[2], s2); 
    ARGV[1] = ARGV[2] = "/dev/null"
}
END{
    for (k in s1) {
    if ( s2[k] ) printpair(k,s1,s2);
    }
    print "file1:"
    for (k in s1) {
    if ( !s2[k] ) print s1[k];
    }
    print "file2:"
    for (k in s2) {
    if ( !s1[k] ) print s2[k];
    }
}
function readfile(fname, sary) {
    while ( getline <fname ) {
    key = $1;
    if (sary[key]) {
        sary[key] = sary[key] "\n" $0; 
    } else {
        sary[key] = $0;
    };
    }
    close(fname);
}
function printpair(key, s1, s2) {
    n1 = split(s1[key],l1,"\n");
    n2 = split(s2[key],l2,"\n");
    for (i=1; i<=max(n1,n2); i++){
    if (i==1) {
        b = l1[1]; 
        gsub("."," ",b);
    }
    if (i<=n1) { f1 = l1[i] } else { f1 = b };
    if (i<=n2) { f2 = l2[i] } else { f2 = b };
    printf("%s\t%s\n",f1,f2);
    }
}
function max(x,y){ z = x; if (y>x) z = y; return z; }

Not particularly elegant, but it handles many-to-many cases.

继续阅读：merge

awk Merge two files based on common field and print similarities and differences

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？