开发者

Awk: Remove duplicate lines with conditions

I have a tab-delimited text file with 8 columns:

Erythropoietin Receptor Integrin Beta 4 11.7    9.7 164 195 19  3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8    2.6 97  107 15  3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0    3.6 171 479 14  3.2
Erythropoietin Receptor Immunoglobulin 9    10.4    3.1 100 108 24  3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7    2.7 93  105 18  3.3
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 5    11.4    3.2 114 114 25  1.7
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 14   11.1    2.1 99  100 28  1.8
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 1B   10.9    4.9 133 162 29  1.9
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 11A  11.5    5.1 130 166 25  1.9

The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is th开发者_StackOverflow中文版e same as Protein B Protein A.


Something like this (untested):

awk -F'\t' 'END {
  for (r in rec) print rec[r] 
  }
{
  if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
    mina[$1, $2] = $NF; minb[$2, $1] = $NF
    rec[$1, $2] = $0
    }  
  }' infile


I hope this would be the final update ^_^

kent$  awk -F'\t' '{if($1$2 in a){
                if($8<a[$1$2]){
                        a[$1$2]=$8;r[$1$2]=$0;
                }
        }else if ($2$1 in a){
                if($8<a[$2$1]){
                        a[$2$1] = $8;r[$2$1] = $0;
                }
        }else{
                a[$1$2]=$8; r[$1$2]=$0;
        }
} END{for(x in r)print r[x]}' yourFile
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜