Binning sequence reads by GC content [closed]

2023-01-31 05:09 问答作者：

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center. Closed 10 years ago.

I would like to "bin" (split into separate files) a multi-fasta nucleotide sequence file (e.g. a Roche-454 run of ~500,000 reads average read length 250bp). I would like the bins based on GC content of each read. The resultant output would be 8 multi-fasta files:

<20% GC content

21-30% GC content

31-40% GC content

41-50% GC content

51-60% GC content

61-70% GC content

71-80% GC content

>80 % GC content

Does anyone know of a script or program that does this already? If not can so开发者_如何学Gomeone suggest how to sort the multi-fasta file based on GC content (that I can then split it down into the relevant bins)?

In R / Bioconductor, the tasks would be (a) load the appropriate library (b) read the fasta file (c) calculate nucleotide use and gc % (d) cut the data into bins and (e) output the original data into separate files. Along the lines of

## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
    fileName = sprintf("%s.fasta", lvl)
    writeFasta(fa[bin==lvl], file=fileName)
}

To get going with R / Bioconductor see http://bioconductor.org/install. The memory requirements for 454 data of the size indicated are not too bad, and the script here will be fairly speedy (e.g., 7s for 260k reads).

I suggest using Python and Biopython or Perl and Bioperl to read in the FASTA-files. There is a script that calculates C-content of sequences in Bioperl here, and Biopython has a function for it. Then simply store the GC content for each sequence in a dictionary or hash and go through each, writing them into a file dependin on how high the GC-content is.

Do you need any more specific help?

If I understand the problem correctly, you need something like the following (Python):

def GC(seq): # determine the GC content
    s = seq.upper()
    return 100.0 * (s.count('G') + s.count('C')) / len(s)

def bin(gc): # get the number of the 'bin' for this value of GC content
    if gc < 20: return 1
    elif gc > 80: return 8
    else:
        return int(gc/10)

Then you just need to read the entries from the file, calculate the GC content, find the right bin and write the entry to the appropriate file. The following example implements this with a Python package we use in the lab:

from pyteomics import fasta

def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""

    for entry in fasta.read(multifile):
        fasta.write((entry,), (multifile+'_bin_'+
                    str(bin(GC(entry[1])))))

Then you just call it like split_to_bin_files('mybig.fasta').

继续阅读：bioinformatics fasta sequences

Binning sequence reads by GC content [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？