best way to compare sequence of letters inside file?
I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all. I'm doing something like this but this isn't exactly want I wanted:for line in fl:
line = line.split()
for elem in line:
if '>' in 开发者_JAVA百科elem:
pass
else:
for el in line:
if elem == el:
print elem, el
example of the file:
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
>2
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
Output:
>3
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA
>4
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA
>2,5
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA
>7
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA
>6
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG
>1
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set
and dict
. It's best to use dict
here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict
with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict
from collections
we can easily handle the case of a sequence not being in the dict
; if the key hasn't be used before defaultdict
will automatically create a value for us, in this case an empty list
.
So when we've finished working through the file the values of the dict
will effectively be a list
of list
s, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
sequences[data].append(id)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
#!/usr/bin/python
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
result[sequence].append(line_number)
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc@etc:~$ python ./fasta.py /path/to/my/file
GTCGTCGAAAGAGGCTT-GCCCGCTACGCGCCCCCTGATA: 1, found in ['4']
GTCGTCGAAAGAGGCTT-GCCCGCCACGCGCCCGCTGATA: 1, found in ['3']
GTCGTCGAAAGAGGTCT-GACCGCTTCGCGCCCGCTGGTA: 2, found in ['2', '5']
GTCGTCGAAAGAGGTCT-GACCGCTTCTCGCCCGCTGATA: 1, found in ['7']
GTCGTCGAAGCATGCCGGGCCCGCTTCGTGTTCGCTGATA: 1, found in ['1']
GTCGTCGAAAGAGTCTGACCGCTTCTCGCCCGCTGATACG: 1, found in ['6']
Update
Changed code to use dafaultdict
and for
loop. Thanks @KennyTM.
Update 2
Changed code to use append
rather than +
. Thanks @Dave Webb.
精彩评论