How can I find low regions in a graph using Perl/R?

2023-01-17 17:13 问答作者：

I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set:

I would like to look for "valleys" in this data, that is, regions which are significantly lower than their开发者_开发知识库 surrounding environment.

Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me:

How can I find low regions in a graph using Perl/R?

What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.

Thanks!

We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.

In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.

The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.

The application of the weighted median :

center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)

Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.

Running Sample code :

require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
      type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
    type <- match.arg(type)
    x <- seq(-1,1,length.out=n)
    out <-switch(type,
          gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
          epanechnikov=0.75*(1-x^2),
          biweight=15/16*(1-x^2)^2,
          triweight=35/32*(1-x^2)^3,
          cosinus=pi/4*cos(x*pi/2),
          )
    out <- out/sum(out)*n
    return(out)
}

# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
    n <- length(x)
    if(missing(w)) w<-rep(1,n)
    w <- w[!is.na(x)]
    x <- x[!is.na(x)]

    if(sum(w)!=n) w <- w/sum(w)*n

    phat <- sum(x*w)/n
    z <- (phat-p0)/sqrt(p0*(1-p0)/n)
    p <- 1-pnorm(z)
    return(p)
}

# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
    extra <- rep(NA,n)
    x <- c(extra,x,extra)
    out <- embed(x,2*n+1)
    return(out)
}

# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n

running.score <- function(sel,n=20,w,p0){
    if(missing(w)) w<- rep(1,2*n+1)
    if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)])   # null hypothesis
    out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
    return(out)
}

# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
    wdw <- 2*n+1
    if(missing(w)) w <- rep(1,wdw)

    center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
    mad <- median(abs(x-center))*cte
    return(list(med=center,mad=mad))
}

##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)

peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))

pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)

plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")

The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.

Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :

#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)

op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)

You can define a valley by different criterion :

depth
width
volume (depth*width)

You might also have valley in a big mountain, do you want these too ?

For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3

Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data

You might want to look at watershed

You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.

It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.

The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.

继续阅读：graph perl r statistics

How can I find low regions in a graph using Perl/R?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？