How to merge two files using AWK? [duplicate]
File 1 has 5 fields A B C D E, with field A is an integer-valued
File 2 has 3 fields A F G
The number of rows in File 1 is much bigger than that of File 2 (20^6 to 5000)
All the entries of A in File 1 appeared in field A in File 2
I like to merge the two files by field A and carry F and G
Desired output is A B C D E F G
Example
File 1
A B C D E
4050 S00001 31228 3286 0
4050 S00012 31227 4251 0
4049 S00001 28342 3021 1
4048 S00001 46578 4210 0
4048 S00113 31221 4250 0
4047 S00122 31225 4249 0
4046 S00344 31322 4000 1
File 2
A 开发者_如何学Python F G
4050 12.1 23.6
4049 14.4 47.8
4048 23.2 43.9
4047 45.5 21.6
Desired output
A B C D E F G
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1
Explanation: (Partly based on another question. A bit late though.)
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands}
means that the commands inside the brackets only executed while processing the first file (file2
now).
FS
refers to the field separator and $1
, $2
etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3
means that a dictionary(/array) (named a
) is filled with $1
key and $2 FS $3
value.
;
separates the commands
next
means that any other commands are ignored for the current line. (The processing continues on the next line.)
$0
is the whole line
{print $0, a[$1]}
simply prints out the whole line and the value of a[$1]
(if $1
is in the dictionary, otherwise only $0
is printed). Now it is only executed for the 2nd file (file1
now), because of FNR==NR{...;next}
.
Thankfully, you don't need to write this at all. Unix has a join command to do this for you.
join -1 1 -2 1 File1 File2
Here it is "in action":
will-hartungs-computer:tmp will$ cat f1
4050 S00001 31228 3286 0
4050 S00012 31227 4251 0
4049 S00001 28342 3021 1
4048 S00001 46578 4210 0
4048 S00113 31221 4250 0
4047 S00122 31225 4249 0
4046 S00344 31322 4000 1
will-hartungs-computer:tmp will$ cat f2
4050 12.1 23.6
4049 14.4 47.8
4048 23.2 43.9
4047 45.5 21.6
will-hartungs-computer:tmp will$ join -1 1 -2 1 f1 f2
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
will-hartungs-computer:tmp will$
You need to read the entries from File 2 into a pair of associative arrays in the BEGIN block. Assuming GNU Awk:
BEGIN { while (getline < "File 2") { f[$1] = $2; g[$1] = $3 } }
In the main processing block, you read the line from File 1 and print it with the correct data from the arrays created in the BEGIN block:
{ print $0, f[$1], g[$1] }
Supply File 1 as the filename argument to the program.
awk 'BEGIN { while (getline < "File 2") { f[$1] = $2; g[$1] = $3 } }
print $0, f[$1], g[$1] }' "File 1"
The quotes around the file name argument are needed because of the spaces in the file name. You need the quotes around the getline
filename even if it contained no spaces as it would otherwise be a variable name.
awk 'BEGIN{OFS=","} FNR==NR {F[$1]=$2;G[$1]=$3;next} {print $1,$2,$3,$4,$5,F[$1],G[$1]}' file2.txt file1.txt
精彩评论