Check if string exist in non-consecutive lines in a given column

2023-03-30 19:58 问答作者：

I have files with the following format:

ATOM   8962  CA  VAL W   8       8.647  81.467  25.656  1.00115.78           C  
ATOM   8963  C   VAL W   8      10.053  80.963  25.506  1.00114.60           C  
ATOM   8964  O   VAL W   8      10.636  80.422  26.442  1.00114.53           O  
ATOM   8965  CB  VAL W   8       7.643  80.389  25.325  1.00115.67           C  
ATOM   8966  CG1 VAL W   8       6.476  80.508  26.249  1.00115.54           C  
ATOM   8967  CG2 V开发者_如何学GoAL W   8       7.174  80.526  23.886  1.00115.26           C  
ATOM   4440  O   TYR S  89       4.530 166.005 -14.543  1.00 95.76           O  
ATOM   4441  CB  TYR S  89       2.847 168.812 -13.864  1.00 96.31           C  
ATOM   4442  CG  TYR S  89       3.887 169.413 -14.756  1.00 98.43           C  
ATOM   4443  CD1 TYR S  89       3.515 170.073 -15.932  1.00100.05           C  
ATOM   4444  CD2 TYR S  89       5.251 169.308 -14.451  1.00100.50           C  
ATOM   4445  CE1 TYR S  89       4.464 170.642 -16.779  1.00100.70           C  
ATOM   4446  CE2 TYR S  89       6.219 169.868 -15.298  1.00101.40           C  
ATOM   4447  CZ  TYR S  89       5.811 170.535 -16.464  1.00100.46           C  
ATOM   4448  OH  TYR S  89       6.736 171.094 -17.321  1.00100.20           O  
ATOM   4449  N   LEU S  90       3.944 166.393 -12.414  1.00 94.95           N  
ATOM   4450  CA  LEU S  90       5.079 165.622 -11.914  1.00 94.44           C  
ATOM   5151  N   LEU W   8     -66.068 209.785 -11.037  1.00117.44           N  
ATOM   5152  CA  LEU W   8     -64.800 210.035 -10.384  1.00116.52           C  
ATOM   5153  C   LEU W   8     -64.177 208.641 -10.198  1.00116.71           C  
ATOM   5154  O   LEU W   8     -64.513 207.944  -9.241  1.00116.99           O  
ATOM   5155  CB  LEU W   8     -65.086 210.682  -9.033  1.00115.76           C  
ATOM   5156  CG  LEU W   8     -64.274 211.829  -8.478  1.00113.89           C  
ATOM   5157  CD1 LEU W   8     -64.528 211.857  -7.006  1.00111.94           C  
ATOM   5158  CD2 LEU W   8     -62.828 211.612  -8.739  1.00112.96           C

In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.

I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.

I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.

SOLUTION (thanks to all for your help and namely to sehe):

for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done

For this particular sample:

cut -c22-23 t | uniq | sort | uniq -dc

Will output

2 W

(the 22nd column contains 2 runs of the letter 'W')

untested

awk '
    seen[$5] && $5 != current {
        print "found non-consecutive chain on line " NR
        exit
    }
    { current = $5; seen[$5] = 1 }
' filename

Here you go, this awk script is tested and takes into account not just 'W':

{
    if (ln[$5] && ln[$5] + 1 != NR) {
        print "dup " $5 " at line " NR;
    }
    ln[$5] = NR;
}

继续阅读：bash duplicates lines

Check if string exist in non-consecutive lines in a given column

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？