开发者

Adding an exponential geom_smooth in ggplot2 / R

I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.

birthday<-function(n){
    ntests<-1000
    pop<-1:365
    anydup<-function(i){
        any(duplicated(sample(pop,n,replace=TRUE)))
        }
    sum(sapply(seq(ntests), anydup))/ntests
    }

x<-data.frame(x=rep(1:100, each=5)) 
x<-ddply(x, .(x),开发者_StackOverflow社区 function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
        geom_point()+geom_smooth()+
        theme_bw()+
        opts(title = "Probability that at least two people share a birthday in a random group")+
        labs(x="Size of Group", y="Probability")

Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?

Thanks

Paul.


The smoothing routine does not react to the sudden change for low values of x fast enough (and it has no way of knowing that the values of prob are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:

birthdayplot + geom_smooth(span=0.1, colour="red")


The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.

birthday<-function(n){
  ntests<-1000
  pop<-1:365
  anydup<-function(i){
    any(duplicated(sample(pop,n,replace=TRUE)))
  }
  data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))

Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.

ggplot(x, aes(n, Dups)) +
  stat_summary(fun.y = mean, geom = "point") +
  stat_smooth(method = "glm", family = binomial)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜