Joining multiple fields in text files on Unix

2022-12-26 05:04 问答作者：

How can I do it?

foo 1 scaf 3 
bar 2 scaf 3.3

File2 looks like this:

foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00

What I want to do is to find lines that co-occur in File1 and File2 when fields 1,2, and 3 are the same.

Is there a way to do it?

Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).

$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

OK, how does it work:

First of all we will use a great tool join which can merge two lines. join has two requirements:
- We can join only by a single field.
- Both files must be sorted by key column!
We need to generate keys in input files and for that we use a simple awk script:
```
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3    

$ <file1 awk '{print $1"-"$2"-"$3" "$0}'
foo-1-scaf foo 1 scaf 3
bar-2-scaf bar 2 scaf 3.3
```
You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW. <file awk, is just fancy way of writing awk file, or cat file | awk.

We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1 (sort by text from column 1 to column 1)
At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called bash process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link.

Our target syntax is: join <( some command ) <(some other command)
The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5
- -j1 - join by key in 1st column (in both files)
- -o - output only those fields 1.2 (1st file field2), 1.3 (1st file column 3), etc.
  
  This way we joined lines, but join outputs only the necessary columns.

The lessons learned from this post should be:

you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,
core utils tools are also blazing fast and heavily tested, so they are always best choice.

The join command is hard to use and only joins on one column

Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.

Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow. The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:

However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.

One possible way to do it with the given files is:

awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2

This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files. The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.

The output is:

bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

Failed attempts to make join do what it won't do

~~join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2~~

On MacOS X 10.6.3, this gives:

$ cat file1 foo 1 scaf 3 bar 2 scaf 3.3 $ cat file2 foo 1 scaf 4.5 foo 1 boo 2.3 bar 2 scaf 1.00 $ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2 foo 1 scaf 3 4.5 bar 2 scaf 3.3 4.5 $

This is joining on field 3 (only) - which is not what is wanted.

~~You do need to ensure that the input files are in the correct sorted order.~~

It's probably easiest to combine the first three fields with awk:

awk '{print $1 "_" $2 "_" $3 " " $4}' filename

Then you can use join normally on "field 1"

you can try this

awk '{
 o1=$1;o2=$2;o3=$3
 $1=$2=$3="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output

$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3

If you want to omit uncommon lines

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[$1$2$3] = s
 next
}
{
  printf $1 FS $2 FS $3 FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[$1$2$3]"\n"
 } ' file2 file1

output

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00

How about:

cat file1 file2
    | awk '{print $1" "$2" "$3}'
    | sort
    | uniq -c
    | grep -v '^ *1 '
    | awk '{print $2" "$3" "$4}'

This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs). This is usually the case when you're talking about fields within a text file.

What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons). It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).

It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off. That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.

This won't work as expected if two identical keys are only in one file since the files are combined early on. In other words, if you have duplicate keys in file1 but not in file2, that will be a false positive.

Then, the only real solution I can think of is a solution which checks file2 for each line in file1 although I'm sure others may come up with cleverer solutions.

And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:

cat file1
    | sed
        -e 's/ [^ ]*$/ "/'
        -e 's/ /  */g'
        -e 's/^/grep "^/'
        -e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
    >xx99
bash xx99
rm xx99

This one constructs a separate script file to do the work. For each line in file1, it creates a line in the script to look for that in file2. If you want to see how it works, just have a look at xx99 before you delete it.

And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1 and file2 (though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline). It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.

This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)

Here is a way to do it in Perl:

#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    $file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
        print "Keys occur at file1:$found->[0] and file2:$..\n";
    }
}
close $file2 or die $!;

Simple method (no awk, join, sed, or perl), using software tools cut, grep, and sort:

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

Output (does not print unmatched lines):

bar 2 scaf 1.00
bar 2 scaf 3.3
foo 1 scaf 3 
foo 1 scaf 4.5

How it works...

cut makes a list of all the lines to search for.
grep's -f - switch inputs the lines from cut and searches File1 and File2 for them.
sort isn't necessary, but makes the data easier to read.

Condensed results with datamash:

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | \
datamash -t ' ' -s -g1,2,3 collapse 4

Output:

bar 2 scaf 3.3,1.00
foo 1 scaf 3,4.5

If File1 is huge and is somewhat redundant, adding sort -u should speed things up:

cut -d ' ' -f1-3 File1 | sort -u | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files. It's called Fsdb. It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).

A similar solution as the one Jonathan Leffler offered.

Create 2 temporary sorted files with a different delimeter which has the matching columns combined in the first field. Then join the temp files on the first field, and output the second field.

$ cat file1.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file1.tmp
$ cat file2.txt |awk -F" " '{print $1"-"$2"-"$3";"$0}' |sort >file2.tmp

$ join -t; -o 1.2 file1.tmp file2.tmp >file1.same.txt
$ join -t; -o 2.2 file1.tmp file2.tmp >file2.same.txt
$ rm -f file1.tmp file2.tmp

$ cat file1.same.txt
bar 2 scaf 3.3
foo 1 scaf 3

$ cat file2.same.txt
bar 2 scaf 1.00
foo 1 scaf 4.5

Using datamash's collapse operation, plus a bit of cosmetic sorting and tring:

cat File* | datamash -t ' ' -s -g1,2,3  collapse 4 | 
sort -g -k2 | tr ',' ' '

Output (common lines have a 5th field, uncommon lines do not):

foo 1 boo 2.3
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00

The OP doesn't show the expected output so idk if this is exactly the desired output but this is the way to approach the problem:

$ awk '
    { key=$1 FS $2 FS $3 }
    NR==FNR { val[key]=$4; next }
    key in val {print $0, val[key] }
' file1 file2
foo 1 scaf 4.5 3
bar 2 scaf 1.00 3.3

继续阅读：bash join

Joining multiple fields in text files on Unix

The join command is hard to use and only joins on one column

Failed attempts to make join do what it won't do

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The join command is hard to use and only joins on one column

Failed attempts to make join do what it won't do

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？