开发者

Check if string exist in non-consecutive lines in a given column

I have files with the following format:

ATOM   8962  CA  VAL W   8       8.647  81.467  25.656  1.00115.78           C  
ATOM   8963  C   VAL W   8      10.053  80.963  25.506  1.00114.60           C  
ATOM   8964  O   VAL W   8      10.636  80.422  26.442  1.00114.53           O  
ATOM   8965  CB  VAL W   8       7.643  80.389  25.325  1.00115.67           C  
ATOM   8966  CG1 VAL W   8       6.476  80.508  26.249  1.00115.54           C  
ATOM   8967  CG2 V开发者_如何学GoAL W   8       7.174  80.526  23.886  1.00115.26           C  
ATOM   4440  O   TYR S  89       4.530 166.005 -14.543  1.00 95.76           O  
ATOM   4441  CB  TYR S  89       2.847 168.812 -13.864  1.00 96.31           C  
ATOM   4442  CG  TYR S  89       3.887 169.413 -14.756  1.00 98.43           C  
ATOM   4443  CD1 TYR S  89       3.515 170.073 -15.932  1.00100.05           C  
ATOM   4444  CD2 TYR S  89       5.251 169.308 -14.451  1.00100.50           C  
ATOM   4445  CE1 TYR S  89       4.464 170.642 -16.779  1.00100.70           C  
ATOM   4446  CE2 TYR S  89       6.219 169.868 -15.298  1.00101.40           C  
ATOM   4447  CZ  TYR S  89       5.811 170.535 -16.464  1.00100.46           C  
ATOM   4448  OH  TYR S  89       6.736 171.094 -17.321  1.00100.20           O  
ATOM   4449  N   LEU S  90       3.944 166.393 -12.414  1.00 94.95           N  
ATOM   4450  CA  LEU S  90       5.079 165.622 -11.914  1.00 94.44           C  
ATOM   5151  N   LEU W   8     -66.068 209.785 -11.037  1.00117.44           N  
ATOM   5152  CA  LEU W   8     -64.800 210.035 -10.384  1.00116.52           C  
ATOM   5153  C   LEU W   8     -64.177 208.641 -10.198  1.00116.71           C  
ATOM   5154  O   LEU W   8     -64.513 207.944  -9.241  1.00116.99           O  
ATOM   5155  CB  LEU W   8     -65.086 210.682  -9.033  1.00115.76           C  
ATOM   5156  CG  LEU W   8     -64.274 211.829  -8.478  1.00113.89           C  
ATOM   5157  CD1 LEU W   8     -64.528 211.857  -7.006  1.00111.94           C  
ATOM   5158  CD2 LEU W   8     -62.828 211.612  -8.739  1.00112.96           C  

In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.

I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.

I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.

SOLUTION (thanks to all for your help and namely to sehe):

for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done


For this particular sample:

cut -c22-23 t | uniq | sort | uniq -dc

Will output

2 W

(the 22nd column contains 2 runs of the letter 'W')


untested

awk '
    seen[$5] && $5 != current {
        print "found non-consecutive chain on line " NR
        exit
    }
    { current = $5; seen[$5] = 1 }
' filename


Here you go, this awk script is tested and takes into account not just 'W':

{
    if (ln[$5] && ln[$5] + 1 != NR) {
        print "dup " $5 " at line " NR;
    }
    ln[$5] = NR;
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜