Binning sequence reads by GC content [closed]
I would like to "bin" (split into separate files) a multi-fasta nucleotide sequence file (e.g. a Roche-454 run of ~500,000 reads average read length 250bp). I would like the bins based on GC content of each read. The resultant output would be 8 multi-fasta files:
<20% GC content
21-30% GC content
31-40% GC content
41-50% GC content
51-60% GC content
61-70% GC content
71-80% GC content
>80 % GC content
Does anyone know of a script or program that does this already? If not can so开发者_如何学Gomeone suggest how to sort the multi-fasta file based on GC content (that I can then split it down into the relevant bins)?
In R / Bioconductor, the tasks would be (a) load the appropriate library (b) read the fasta file (c) calculate nucleotide use and gc % (d) cut the data into bins and (e) output the original data into separate files. Along the lines of
## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
fileName = sprintf("%s.fasta", lvl)
writeFasta(fa[bin==lvl], file=fileName)
}
To get going with R / Bioconductor see http://bioconductor.org/install. The memory requirements for 454 data of the size indicated are not too bad, and the script here will be fairly speedy (e.g., 7s for 260k reads).
I suggest using Python and Biopython or Perl and Bioperl to read in the FASTA-files. There is a script that calculates C-content of sequences in Bioperl here, and Biopython has a function for it. Then simply store the GC content for each sequence in a dictionary or hash and go through each, writing them into a file dependin on how high the GC-content is.
Do you need any more specific help?
If I understand the problem correctly, you need something like the following (Python):
def GC(seq): # determine the GC content
s = seq.upper()
return 100.0 * (s.count('G') + s.count('C')) / len(s)
def bin(gc): # get the number of the 'bin' for this value of GC content
if gc < 20: return 1
elif gc > 80: return 8
else:
return int(gc/10)
Then you just need to read the entries from the file, calculate the GC content, find the right bin and write the entry to the appropriate file. The following example implements this with a Python package we use in the lab:
from pyteomics import fasta
def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""
for entry in fasta.read(multifile):
fasta.write((entry,), (multifile+'_bin_'+
str(bin(GC(entry[1])))))
Then you just call it like split_to_bin_files('mybig.fasta')
.
精彩评论