Defining a function that counts relative frequency of Amino Acids
I'm trying to calculate codon frequency within a given sequence of DNA.
For example:
sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
f开发者_高级运维or XX in codons:
frequency = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))
Note that XX2 and XX3 will not always be in the sequence. Some codons may or may not have multiple codons.
Example: Lysine has 2 codons, AAA and AAG
so the frequency of
AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))
How can I do this for EVERY codon in the list? How do I account for multiple codons?
use defaultdict
from collections import defaultdict
mydict = defaultdict(int)
for aa in mysecuence:
mydict[aa] +=1
This works for aminoacids (proteins).
For codons, you should iterate on the sequence in 3 positions steps to get the keys of the defaultdict. For example:
>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a
['GAU', 'CAC', 'TUG', 'CCA']
EDIT: If you want to calculate degeneration, you should prepare a dictionary relating each codon (key) with its degenerated codons (value, list of codons). To calculate the frecuency, from the defaultdict you can get the counts for each codon, then for each codon you calculate the sum of the counts of the degenerated codons read from the dictionary of codons indicated above. Then you can calculate the frecuency.
EDIT 2: Here you have a real example:
from collections import defaultdict
#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
seq = rna.upper().replace('T', 'U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
('UAA', 'UGA', 'UAG'))
#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
for codon in codons:
degen_dict[codon] = codons
#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]
#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
counts[codon] +=1
#actual calculation of frecuencies
data = {}
for codon in query_codons:
if codon in degen_dict:
totals = sum(counts[deg] for deg in degen_dict[codon])
frecuency = float(counts[codon]) / totals
else:
frecuency = 1.00
data[codon] = frecuency
#print results
for codon, frecuency in data.iteritems():
print "%s -> %.2f" %(codon, frecuency)
#produces:
GUC -> 0.57
AUA -> 1.00
ACG -> 0.50
AAC -> 1.00
CCU -> 0.25
UAU -> 1.00
..........
GCU -> 0.19
GAU -> 1.00
UAG -> 0.33
CUC -> 0.38
UUA -> 0.13
UGA -> 0.33
If your sequence is in the correct reading frame:
>>> import collections
>>>
>>> code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
... 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
... 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
... 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
... 'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
... 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
... 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
... 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
... 'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
... 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
... 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
... 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
... 'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
... 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
... 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
... 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
... }
>>> def count_codons(cds):
... counts = collections.defaultdict(int)
... for i in range(0,len(cds),3):
... codon = cds[i:i+3]
... counts[codon] += 1
... return counts
...
>>> def translate(cds, code):
... return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
...
>>> seq = 'ATGAAGAAA'
>>>
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>>
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>>
other info:
I think you are asking to find something called codon usage.
There are tools online which allow you to find codon usage. This one also allows for offline use.
http://www.bioinformatics.org/sms2/codon_usage.html
and results (in this 'Fraction' is what you are asking for):
Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid Codon Number /1000 Fraction ..
Ala GCG 0.00 0.00 0.00
Ala GCA 0.00 0.00 0.00
Ala GCT 0.00 0.00 0.00
Ala GCC 0.00 0.00 0.00
Cys TGT 0.00 0.00 0.00
Cys TGC 0.00 0.00 0.00
Asp GAT 0.00 0.00 0.00
Asp GAC 0.00 0.00 0.00
Glu GAG 0.00 0.00 0.00
Glu GAA 0.00 0.00 0.00
Phe TTT 0.00 0.00 0.00
Phe TTC 0.00 0.00 0.00
Gly GGG 0.00 0.00 0.00
Gly GGA 0.00 0.00 0.00
Gly GGT 0.00 0.00 0.00
Gly GGC 0.00 0.00 0.00
His CAT 0.00 0.00 0.00
His CAC 0.00 0.00 0.00
Ile ATA 0.00 0.00 0.00
Ile ATT 0.00 0.00 0.00
Ile ATC 0.00 0.00 0.00
Lys AAG 1.00 333.33 0.50
Lys AAA 1.00 333.33 0.50
Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 0.00 0.00 0.00
Leu CTC 0.00 0.00 0.00
Met ATG 1.00 333.33 1.00
Asn AAT 0.00 0.00 0.00
Asn AAC 0.00 0.00 0.00
Pro CCG 0.00 0.00 0.00
Pro CCA 0.00 0.00 0.00
Pro CCT 0.00 0.00 0.00
Pro CCC 0.00 0.00 0.00
Gln CAG 0.00 0.00 0.00
Gln CAA 0.00 0.00 0.00
Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 0.00 0.00 0.00
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 0.00 0.00 0.00
Ser AGT 0.00 0.00 0.00
Ser AGC 0.00 0.00 0.00
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00
Thr ACG 0.00 0.00 0.00
Thr ACA 0.00 0.00 0.00
Thr ACT 0.00 0.00 0.00
Thr ACC 0.00 0.00 0.00
Val GTG 0.00 0.00 0.00
Val GTA 0.00 0.00 0.00
Val GTT 0.00 0.00 0.00
Val GTC 0.00 0.00 0.00
Trp TGG 0.00 0.00 0.00
Tyr TAT 0.00 0.00 0.00
Tyr TAC 0.00 0.00 0.00
End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00
cusp is the codon usage tool from EMBOSS which also may be worth taking a look at.
You may want to checkout BioPython for working with biological sequences. I believe they have a codon usage module.
PLY is a parser module that has some nice debugging features; it is very good at tasks like this...
from ply import lex
tokens = (
'CODON',
)
t_CODON = (
r"ATG|"
r"AAG|"
r"AAF|"
r"AAC|"
r"BOB|"
r"FOO|"
r"BAR|"
r"AAA"
)
def t_error(t):
raise TypeError("Unknown codon '%s'" % (t.value,))
lex.lex()
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
ccount = dict()
total = 0.0
lex.input(sequence)
for tok in iter(lex.token, None):
if ccount.get(tok.value, False):
ccount[tok.value] += 1
else:
ccount[tok.value] = 1
total += 1.0
for codon,count in ccount.items():
print "Frequency of %s is %f" % (codon, count/total)
Running that code produces...
[mpenning@Bucksnort ~]$ python codon.py
Frequency of BAR is 0.111111
Frequency of BOB is 0.111111
Frequency of FOO is 0.111111
Frequency of AAA is 0.444444
Frequency of AAC is 0.222222
I'm a little lost when you start introducing the chemical terminology, but you can probably take over from here...
a codon table containing ALL the 64 codons, even the non-degenarated ones (they constitute one element groups)
counting the occurences of each codon's group at the same time that occurences of codons are counted during the iteration
codon table comprising the names of coded amino acids -> a good display
code:
from collections import defaultdict
# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
arn = adn.upper().replace('T','U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'), 'Alanine'),
(('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'), 'Leucine'),
(('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'), 'Arginine'),
(('AAA', 'AAG'), 'Lysine'),
(('AAU', 'AAC'), 'Asparagine'),
(('AUG',), 'Methionine'),
(('GAU', 'GAC'), 'Aspartic acid' ),
(('UUU', 'UUC'), 'Phenylalanine'),
(('UGU', 'UGC'), 'Cysteine'),
(('CCU', 'CCC', 'CCA', 'CCG'), 'Proline') ,
(('CAA', 'CAG'), 'Glutamine'),
(('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'), 'Serine'),
(('GAA', 'GAG'), 'Glutamic acid'),
(('ACU', 'ACC', 'ACA', 'ACG'), 'Threonine'),
(('GGU', 'GGC', 'GGA', 'GGG'), 'Glycine'),
(('UGG',), 'Tryptophane'),
(('CAU', 'CAC'), 'Histidine'),
(('UAU', 'UAC'), 'Tyrosine'),
(('AUU', 'AUC', 'AUA'), 'Isoleucine'),
(('GUU', 'GUC', 'GUA', 'GUG'), 'Valine'),
(('UAA', 'UGA', 'UAG'), 'STOP') )
siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )
cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}
for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
cod_count[cod] += 1
grp_count[siblings[cod]] += 1
for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
if siblings[cod] in grp_count:
freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
else:
freq[cod] = '-* Missing *-'
display = '\n'.join(aa.rjust(13)+\
'\n'.join('%s %-16s' % (cod.rjust(18 if i else 5),freq[cod])
for i,cod in enumerate(codgrp))
for codgrp,aa in codon_table)
# editing addition:
def outputResults(filename,arn,codon_table,displ):
li = ['This file is named %s' % filename]
li.append('The sequence of ARN:\n%s' %\
'\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
li.append('Size of the sequence : '+str(len(arn)))
li.append('Codon_table:\n'+\
'\n'.join('%s : %s' % (u,v) for u,v in codon_table))
li.append('Frequency results :\n'+displ)
with open(filename,'w') as f:
f.writelines('\n\n'.join(li))
outputResults('ARN_mem.txt',arn,codon_table,display)
print display
.
EDIT
I've added a function outputResults() to show the manner to record data and results in a file
The resulting file's content is:
This file is named ARN_mem.txt
The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA
Size of the sequence : 360
Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP
Frequency results :
Alanine GCU 0.1875
GCC 0.375
GCA 0.25
GCG 0.1875
Leucine UUA 0.125
UUG 0.0
CUU 0.25
CUC 0.375
etc.............
I'm not sure if I've fully understood the question, but I think you need to split the calculations into two stages: first count how many times each codon occurs, and then work out the frequencies. I've come up with the following code:
from collections import defaultdict
# Initial sequence.
sequence = "AAABOBAACAAAFOOAACBARAAAAAA"
# Which codons are grouped together.
groups = (
('AAA', 'AAC'),
('BOB',),
('FOO', 'BAR', 'BAA'),
)
# Separate into list of codons.
codonList = []
for codons in range(0, len(sequence), 3):
codonList.append(sequence[codons:codons+3])
# Count how many times each codon is used.
counts = defaultdict(int)
for codon in codonList:
counts[codon] += 1
# Go through and calculate frequencies of each codon.
freqs = {}
for group in groups:
total = float(sum(counts[codon] for codon in group))
for codon in group:
freqs[codon] = counts[codon] / total
# Done.
print freqs
Note the explicit conversion of total
to a floating-point number in the last loop. If it is left as an integer, the subsequent division will be either 0 or 1 on Python 2.x, so we need to convert it to get a floating-point output. The output I get is:
blair@blair-eeepc:~$ python codons.py
{'BAR': 0.5, 'AAC': 0.33333333333333331, 'BAA': 0.0, 'AAA': 0.66666666666666663, 'BOB': 1.0, 'FOO': 0.5}
Is this the sort of output you were looking for?
精彩评论