best way to compare sequence of letters inside file?

2023-01-15 05:10 问答作者：

I have a file, that have lots of sequences of letters.

Some of these sequences might be equal, so I would like to compare them, all to all.

I'm doing something like this but this isn't exactly want I wanted:

for line in fl:
line = line.split()
for elem in line:
    if '>' in 开发者_JAVA百科elem:
        pass
    else:
        for el in line:
            if elem == el:
                print elem, el

example of the file:

>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA  
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA    
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA  
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA  
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA  
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG  
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA

So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.

If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:

from itertools import groupby
from Bio       import SeqIO

records = list(SeqIO.parse(file('spoo.fa'),'fasta'))

def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)

for seq,equal in groupby(records, seq_getter):
  ids = ','.join(s.id for s in equal)
  print '>%s' % ids
  print seq

Output:

>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA

In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.

However, your particular problem can be solved using a dict, an example of which Manoj has given you.

Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.

I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.

We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.

So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.

from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)

while (lines):
    id = lines.pop(0)
    data = lines.pop(0)
    sequences[data].append(id)

results = [match for match in sequences.values() if len(match) > 1]
print results

The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.

#!/usr/bin/python
import sys
from collections import defaultdict

def count_sequences(filename):
    result = defaultdict(list)
    with open(filename) as f:
        for index, line in enumerate(f):        
            sequence = line.replace('\n', '')
            line_number = index + 1
            result[sequence].append(line_number)
    return result

if __name__ == '__main__':
    filename = sys.argv[1]
    for sequence, occurrences in count_sequences(filename).iteritems():
        print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)

Sample output:

etc@etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']

Update

Changed code to use dafaultdict and for loop. Thanks @KennyTM.

Update 2

Changed code to use append rather than +. Thanks @Dave Webb.

继续阅读：python

best way to compare sequence of letters inside file?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？