开发者

Binning sequence reads by GC content [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center. Closed 10 years ago.

I would like to "bin" (split into separate files) a multi-fasta nucleotide sequence file (e.g. a Roche-454 run of ~500,000 reads average read length 250bp). I would like the bins based on GC content of each read. The resultant output would be 8 multi-fasta files:

<20% GC content

21-30% GC content

31-40% GC content

41-50% GC content

51-60% GC content

61-70% GC content

71-80% GC content

>80 % GC content

Does anyone know of a script or program that does this already? If not can so开发者_如何学Gomeone suggest how to sort the multi-fasta file based on GC content (that I can then split it down into the relevant bins)?


In R / Bioconductor, the tasks would be (a) load the appropriate library (b) read the fasta file (c) calculate nucleotide use and gc % (d) cut the data into bins and (e) output the original data into separate files. Along the lines of

## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
    fileName = sprintf("%s.fasta", lvl)
    writeFasta(fa[bin==lvl], file=fileName)
}

To get going with R / Bioconductor see http://bioconductor.org/install. The memory requirements for 454 data of the size indicated are not too bad, and the script here will be fairly speedy (e.g., 7s for 260k reads).


I suggest using Python and Biopython or Perl and Bioperl to read in the FASTA-files. There is a script that calculates C-content of sequences in Bioperl here, and Biopython has a function for it. Then simply store the GC content for each sequence in a dictionary or hash and go through each, writing them into a file dependin on how high the GC-content is.

Do you need any more specific help?


If I understand the problem correctly, you need something like the following (Python):

def GC(seq): # determine the GC content
    s = seq.upper()
    return 100.0 * (s.count('G') + s.count('C')) / len(s)

def bin(gc): # get the number of the 'bin' for this value of GC content
    if gc < 20: return 1
    elif gc > 80: return 8
    else:
        return int(gc/10)

Then you just need to read the entries from the file, calculate the GC content, find the right bin and write the entry to the appropriate file. The following example implements this with a Python package we use in the lab:

from pyteomics import fasta

def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""

    for entry in fasta.read(multifile):
        fasta.write((entry,), (multifile+'_bin_'+
                    str(bin(GC(entry[1])))))

Then you just call it like split_to_bin_files('mybig.fasta').

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜