开发者

Algorithm for variability analysis

I work with a lot of histograms. In particular, these histograms are of basecalls along segments on the human genome.

Each point along the x-axis is one of the four nitrogenous bases(A,C,T,G) that compose DNA and the y-axis represents how many times a base was able to be "called" (or recognized by a sequencer machine, so as to sequence the genome, which is simply determining the identity of each base along the genome).

Many of these histograms display roughly linear dropoffs (when the machines aren't able to get sufficient read depth) that fall to 0 or (almost-0) from plateau-like regions. When the score drops to zero, it means the sequencer isn't able to determine开发者_运维百科 the identity of the base. If you've seen the double helix before, it means the sequencer can't figure out the identify of one half of a rung of the helix. Certain regions of the genome are more difficult to characterize than others. Bases (or x data points) with high numbers of basecalls, on the order of >=100, are able to be definitively identified. For example, if there were a total of 250 calls for one base, and we had 248 T's called, 1 G called, and 1 A called, we would call that a T. Regions with 0 basecalls are of concern because then we've got to infer from neighboring regions what the identity of the low-read region could be. Is there a straightforward algorithm for assigning these plots a score that reflects this tendency? See box.net/shared/nbygq2x03u for an example histo.


You could just use the count of base numbers where read depth was 0... The slope of that line could also be a useful indicator (steep negative slope = drop from plateau).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜