Joining multiple fields in text files on Unix
How can I do it?
File1 looks like开发者_C百科 this:
foo 1 scaf 3
bar 2 scaf 3.3
File2 looks like this:
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.
Is there a way to do it?
Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).
$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
OK, how does it work:
First of all we will use a great tool
join
which can merge two lines.join
has two requirements:- We can join only by a single field.
- Both files must be sorted by key column!
We need to generate keys in input files and for that we use a simple
awk
script:$ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ <file1 awk '{print $1"-"$2"-"$3" "$0}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3
You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW.
<file awk
, is just fancy way of writingawk file
, orcat file | awk
.We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the
| sort -k1,1
(sort by text from column 1 to column 1)At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called
bash
process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link.Our target syntax is:
join <( some command ) <(some other command)
The last thing is to explain fancy join arguments:
-j1 -o1.2,1.3,1.4,1.5,2.5
-j1
- join by key in 1st column (in both files)-o
- output only those fields1.2
(1st file field2),1.3
(1st file column 3), etc.This way we joined lines, but
join
outputs only the necessary columns.
The lessons learned from this post should be:
- you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,
- core utils tools are also blazing fast and heavily tested, so they are always best choice.
The join command is hard to use and only joins on one column
Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.
Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow. The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:
However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.
One possible way to do it with the given files is:
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2
This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files. The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.
The output is:
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
Failed attempts to make join do what it won't do
join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
On MacOS X 10.6.3, this gives:
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3
$ cat file2
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
$ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
foo 1 scaf 3 4.5
bar 2 scaf 3.3 4.5
$
This is joining on field 3 (only) - which is not what is wanted.
You do need to ensure that the input files are in the correct sorted order.
It's probably easiest to combine the first three fields with awk:
awk '{print $1 "_" $2 "_" $3 " " $4}' filename
Then you can use join
normally on "field 1"
you can try this
awk '{
o1=$1;o2=$2;o3=$3
$1=$2=$3="";gsub(" +","")
_[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2
output
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
foo 1 boo 2.3
If you want to omit uncommon lines
awk 'FNR==NR{
s=""
for(i=4;i<=NF;i++){ s=s FS $i }
_[$1$2$3] = s
next
}
{
printf $1 FS $2 FS $3 FS
for(o=4;o<NF;o++){
printf $i" "
}
printf $NF FS _[$1$2$3]"\n"
} ' file2 file1
output
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
How about:
cat file1 file2
| awk '{print $1" "$2" "$3}'
| sort
| uniq -c
| grep -v '^ *1 '
| awk '{print $2" "$3" "$4}'
This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs). This is usually the case when you're talking about fields within a text file.
What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons). It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).
It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off. That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.
This won't work as expected if two identical keys are only in one file since the files are combined early on. In other words, if you have duplicate keys in file1
but not in file2
, that will be a false positive.
Then, the only real solution I can think of is a solution which checks file2
for each line in file1
although I'm sure others may come up with cleverer solutions.
And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:
cat file1
| sed
-e 's/ [^ ]*$/ "/'
-e 's/ / */g'
-e 's/^/grep "^/'
-e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
>xx99
bash xx99
rm xx99
This one constructs a separate script file to do the work. For each line in file1
, it creates a line in the script to look for that in file2
. If you want to see how it works, just have a look at xx99
before you delete it.
And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1
and file2
(though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline). It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.
This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)
Here is a way to do it in Perl:
#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
my @keys = split /\s+/, $_;
next unless @keys;
$file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
my @keys = split /\s+/, $_;
next unless @keys;
if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
print "Keys occur at file1:$found->[0] and file2:$..\n";
}
}
close $file2 or die $!;
Simple method (no awk, join, sed, or perl), using software tools cut
, grep
, and sort
:
cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g
Output (does not print unmatched lines):
bar 2 scaf 1.00
bar 2 scaf 3.3
foo 1 scaf 3
foo 1 scaf 4.5
How it works...
cut
makes a list of all the lines to search for.grep
's-f -
switch inputs the lines fromcut
and searches File1 and File2 for them.sort
isn't necessary, but makes the data easier to read.
Condensed results with datamash
:
cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | \
datamash -t ' ' -s -g1,2,3 collapse 4
Output:
bar 2 scaf 3.3,1.00
foo 1 scaf 3,4.5
If File1 is huge and is somewhat redundant, adding sort -u
should speed things up:
cut -d ' ' -f1-3 File1 | sort -u | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g
A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files. It's called Fsdb. It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).
A similar solution as the one Jonathan Leffler offered.
Create 2 temporary sorted files with a different delimeter which has the matching columns combined in the first field. Then join the temp files on the first field, and output the second field.
$ cat file1.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file1.tmp
$ cat file2.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file2.tmp
$ join -t; -o 1.2 file1.tmp file2.tmp >file1.same.txt
$ join -t; -o 2.2 file1.tmp file2.tmp >file2.same.txt
$ rm -f file1.tmp file2.tmp
$ cat file1.same.txt
bar 2 scaf 3.3
foo 1 scaf 3
$ cat file2.same.txt
bar 2 scaf 1.00
foo 1 scaf 4.5
Using datamash
's collapse operation, plus a bit of cosmetic sort
ing and tr
ing:
cat File* | datamash -t ' ' -s -g1,2,3 collapse 4 |
sort -g -k2 | tr ',' ' '
Output (common lines have a 5th field, uncommon lines do not):
foo 1 boo 2.3
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
The OP doesn't show the expected output so idk if this is exactly the desired output but this is the way to approach the problem:
$ awk '
{ key=$1 FS $2 FS $3 }
NR==FNR { val[key]=$4; next }
key in val {print $0, val[key] }
' file1 file2
foo 1 scaf 4.5 3
bar 2 scaf 1.00 3.3
精彩评论