开发者

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.

Some of these sequences might be equal, so I would like to compare them, all to all.

I'm doing something like this but this isn't exactly want I wanted:

for line in fl:
line = line.split()
for elem in line:
    if '>' in 开发者_JAVA百科elem:
        pass
    else:
        for el in line:
            if elem == el:
                print elem, el

example of the file:

>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA  
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA    
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA  
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA  
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA  
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG  
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA

So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.


If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:

from itertools import groupby
from Bio       import SeqIO

records = list(SeqIO.parse(file('spoo.fa'),'fasta'))

def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)

for seq,equal in groupby(records, seq_getter):
  ids = ','.join(s.id for s in equal)
  print '>%s' % ids
  print seq

Output:

>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA


In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.

However, your particular problem can be solved using a dict, an example of which Manoj has given you.


Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.

I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.

We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.

So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.

from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)

while (lines):
    id = lines.pop(0)
    data = lines.pop(0)
    sequences[data].append(id)

results = [match for match in sequences.values() if len(match) > 1]
print results


The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.

#!/usr/bin/python
import sys
from collections import defaultdict

def count_sequences(filename):
    result = defaultdict(list)
    with open(filename) as f:
        for index, line in enumerate(f):        
            sequence = line.replace('\n', '')
            line_number = index + 1
            result[sequence].append(line_number)
    return result

if __name__ == '__main__':
    filename = sys.argv[1]
    for sequence, occurrences in count_sequences(filename).iteritems():
        print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)

Sample output:

etc@etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']

Update

Changed code to use dafaultdict and for loop. Thanks @KennyTM.

Update 2

Changed code to use append rather than +. Thanks @Dave Webb.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜