awk Merge two files based on common field and print similarities and differences
I have two files I would like to merge into a third but I need to see both when they share a common field and where they differ.Since there are minor differences in other fields, I cannot use a diff tool and I thought this could be done with awk.
File 1:
aWonderfulMachine 1 mlqsjflk
Anoth开发者_如何学编程erWonderfulMachine 2 mlksjf
YetAnother WonderfulMachine 3 sdg
TrashWeWon'tBuy 4 jhfgjh
MoreTrash 5 qsfqf
MiscelleneousStuff 6 qfsdf
MoreMiscelleneousStuff 7 qsfwsf
File2:
aWonderfulMachine 22 dfhdhg
aWonderfulMachine 23 dfhh
aWonderfulMachine 24 qdgfqf
AnotherWonderfulMachine 25 qsfsq
AnotherWonderfulMachine 26 qfwdsf
MoreDifferentStuff 27 qsfsdf
StrangeStuffBought 28 qsfsdf
Desired output:
aWonderfulMachine 1 mlqsjflk aWonderfulMachine 22 dfhdhg
aWonderfulMachine 23 dfhdhg
aWonderfulMachine 24 dfhh
AnotherWonderfulMachine 2 mlksjf AnotherWonderfulMachine 25 qfwdsf
AnotherWonderfulMachine 26 qfwdsf
File1
YetAnother WonderfulMachine 3 sdg
TrashWeWon'tBuy 4 jhfgjh
MoreTrash 5 qsfqf
MiscelleneousStuff 6 qfsdf
MoreMiscelleneousStuff 7 qsfwsf
File2
MoreDifferentStuff 27 qsfsdf
StrangeStuffBought 28 qsfsdf
I have tried a few awks scripts here and there, but they are either based on two fields only, and I don't know how to modify the output, or they delete the duplicates based on two fields only, etc (I am new to this and awk syntax is tough). Thank you much in advance for your help.
You can come very close using these three commands:
join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
This assumes a shell, such as Bash, that supports process substitution (<()
). If you're using a shell that doesn't, the files would need to be pre-sorted.
To do this in AWK:
#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next } # process the first file
{ # process the second file and do output
lines2[$1] = $0;
count2[$1]++;
if ($1 != prev) { flag = 1 };
if (count1[$1]) {
if (flag) printf "%s ", lines1[$1];
else printf "\t\t\t\t\t"
flag = 0;
printf "\t%s\n", $0
}
prev = $1
}
END { # output lines that are unique to one file or the other
print "File 1: " file1
for (i in lines1) if (! (i in lines2)) print lines1[i]
print "File 2: " file2
for (i in lines2) if (! (i in lines1)) print lines2[i]
}
To run it:
$ ./script.awk file1 file2
The lines won't be output in the same order that they appear in the input files. The second input file (file2) needs to be sorted since the script assumes that similar lines are adjacent. You will probably want to adjust the tabs or other spacing in the script. I haven't done much in that regard.
One way to do it (albeit with hardcoded file names):
BEGIN {
FS="\t";
readfile(ARGV[1], s1);
readfile(ARGV[2], s2);
ARGV[1] = ARGV[2] = "/dev/null"
}
END{
for (k in s1) {
if ( s2[k] ) printpair(k,s1,s2);
}
print "file1:"
for (k in s1) {
if ( !s2[k] ) print s1[k];
}
print "file2:"
for (k in s2) {
if ( !s1[k] ) print s2[k];
}
}
function readfile(fname, sary) {
while ( getline <fname ) {
key = $1;
if (sary[key]) {
sary[key] = sary[key] "\n" $0;
} else {
sary[key] = $0;
};
}
close(fname);
}
function printpair(key, s1, s2) {
n1 = split(s1[key],l1,"\n");
n2 = split(s2[key],l2,"\n");
for (i=1; i<=max(n1,n2); i++){
if (i==1) {
b = l1[1];
gsub("."," ",b);
}
if (i<=n1) { f1 = l1[i] } else { f1 = b };
if (i<=n2) { f2 = l2[i] } else { f2 = b };
printf("%s\t%s\n",f1,f2);
}
}
function max(x,y){ z = x; if (y>x) z = y; return z; }
Not particularly elegant, but it handles many-to-many cases.
精彩评论