开发者

awk Merge two files based on common field and print similarities and differences

I have two files I would like to merge into a third but I need to see both when they share a common field and where they differ.Since there are minor differences in other fields, I cannot use a diff tool and I thought this could be done with awk.

File 1:

aWonderfulMachine             1   mlqsjflk          
Anoth开发者_如何学编程erWonderfulMachine     2   mlksjf          
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6  qfsdf           
MoreMiscelleneousStuff       7  qsfwsf

File2:

aWonderfulMachine             22    dfhdhg          
aWonderfulMachine             23    dfhh            
aWonderfulMachine             24    qdgfqf          
AnotherWonderfulMachine     25    qsfsq         
AnotherWonderfulMachine     26    qfwdsf            
MoreDifferentStuff           27    qsfsdf           
StrangeStuffBought           28    qsfsdf

Desired output:

aWonderfulMachine   1   mlqsjflk    aWonderfulMachine   22  dfhdhg
                                     aWonderfulMachine  23  dfhdhg
                                     aWonderfulMachine  24  dfhh
AnotherWonderfulMachine 2   mlksjf  AnotherWonderfulMachine 25  qfwdsf
                                       AnotherWonderfulMachine  26  qfwdsf
File1
YetAnother WonderfulMachine 3   sdg         
TrashWeWon'tBuy             4   jhfgjh          
MoreTrash                     5   qsfqf         
MiscelleneousStuff           6   qfsdf          
MoreMiscelleneousStuff       7   qsfwsf         
File2                   
MoreDifferentStuff          27  qsfsdf          
StrangeStuffBought          28  qsfsdf  

I have tried a few awks scripts here and there, but they are either based on two fields only, and I don't know how to modify the output, or they delete the duplicates based on two fields only, etc (I am new to this and awk syntax is tough). Thank you much in advance for your help.


You can come very close using these three commands:

join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)

This assumes a shell, such as Bash, that supports process substitution (<()). If you're using a shell that doesn't, the files would need to be pre-sorted.

To do this in AWK:

#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next }  # process the first file
{   # process the second file and do output
    lines2[$1] = $0;
    count2[$1]++;
    if ($1 != prev) { flag = 1 };
    if (count1[$1]) {
        if (flag) printf "%s ", lines1[$1];
        else printf "\t\t\t\t\t"
        flag = 0;
        printf "\t%s\n", $0
    }
    prev = $1
}
END { # output lines that are unique to one file or the other
    print "File 1: " file1
    for (i in lines1) if (! (i in lines2)) print lines1[i]
    print "File 2: " file2
    for (i in lines2) if (! (i in lines1)) print lines2[i]
}

To run it:

$ ./script.awk file1 file2

The lines won't be output in the same order that they appear in the input files. The second input file (file2) needs to be sorted since the script assumes that similar lines are adjacent. You will probably want to adjust the tabs or other spacing in the script. I haven't done much in that regard.


One way to do it (albeit with hardcoded file names):

BEGIN {
    FS="\t"; 
    readfile(ARGV[1], s1); 
    readfile(ARGV[2], s2); 
    ARGV[1] = ARGV[2] = "/dev/null"
}
END{
    for (k in s1) {
    if ( s2[k] ) printpair(k,s1,s2);
    }
    print "file1:"
    for (k in s1) {
    if ( !s2[k] ) print s1[k];
    }
    print "file2:"
    for (k in s2) {
    if ( !s1[k] ) print s2[k];
    }
}
function readfile(fname, sary) {
    while ( getline <fname ) {
    key = $1;
    if (sary[key]) {
        sary[key] = sary[key] "\n" $0; 
    } else {
        sary[key] = $0;
    };
    }
    close(fname);
}
function printpair(key, s1, s2) {
    n1 = split(s1[key],l1,"\n");
    n2 = split(s2[key],l2,"\n");
    for (i=1; i<=max(n1,n2); i++){
    if (i==1) {
        b = l1[1]; 
        gsub("."," ",b);
    }
    if (i<=n1) { f1 = l1[i] } else { f1 = b };
    if (i<=n2) { f2 = l2[i] } else { f2 = b };
    printf("%s\t%s\n",f1,f2);
    }
}
function max(x,y){ z = x; if (y>x) z = y; return z; }

Not particularly elegant, but it handles many-to-many cases.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜