Why do values not appear in ecdf plot?
I am trying to plot the ccdf of the data given below but for some reason, it doesn't look right. I was cross checking with some data points (2523, 313, 224) but they are not visible. Am I doing something wrong?
R Script:
# Y defined below
Y.ecdf = ecdf(Y)
curve((length((Y))*(1-Y.ecdf(x))), n = 10000,
from = 0, to = 100, xlab = "# of items",
ylab = "# instances", col=colors[1], lty=1, lwd=4)
Y = c( 3, 1, 4, 11, 2, 2, 9, 7, 22, 3, 1, 1, 7, 2, 2, 2, 4, 2, 1, 1, 6, 3, 20,
15, 4, 1, 1, 5, 3, 10, 16, 224, 74, 2, 1, 2, 2, 3, 3, 7, 2, 2, 1, 4, 2, 9,
3, 3, 2, 1, 1, 3, 2, 4, 4, 1, 7, 2, 1, 2, 1, 1, 2, 4, 3, 1, 1, 1, 3, 4, 2,
2, 1, 1, 5, 6, 13, 15, 3, 1, 2, 5, 1, 1, 1, 1, 2, 6, 1, 4, 1, 3, 1, 1, 4,
2, 2, 3, 3, 1, 4, 2, 1, 4, 6, 1, 1, 1, 1, 2, 5, 2, 1, 1, 1, 1, 1, 3, 1, 3,
2, 1, 1, 1, 2, 1, 8, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 2, 1, 2, 1, 1, 5, 1, 1,
4, 3, 3, 1, 1, 1, 3, 4, 4, 3, 2, 2, 4, 3, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3,
2, 3, 9, 3, 4, 2, 1, 1, 1, 3, 22, 5, 13, 1, 1, 1, 1, 1, 4, 1, 1, 31, 1, 1,
2, 1, 1, 1, 3, 4, 4, 8, 6, 6, 7, 2, 1, 2, 2, 5, 1, 2, 6, 6, 1, 3, 1, 5, 2,
1, 5, 3, 1, 2, 2, 1开发者_JAVA技巧, 2, 1, 2, 2, 1, 2, 1, 1, 4, 1, 3, 2, 1, 4, 1, 212, 2,
7, 7, 10, 2, 4, 2, 1, 1, 1, 2, 3, 2, 1, 16, 6, 2, 10, 2, 1, 1, 15, 1, 3, 8,
1, 1, 3, 1, 1, 2, 1, 1, 4, 2, 3, 1, 1, 1, 1, 5, 9, 4, 1, 1, 2, 5, 1, 4, 9,
6, 19, 1, 1, 1, 2, 10, 6, 9, 5, 11, 6, 8, 1, 1, 1, 1, 1, 313, 3, 1, 3, 1,
2, 2, 2, 3, 4, 5, 1, 1, 3, 1, 1, 5, 4, 2, 5, 1, 20, 4, 1, 2, 1, 1, 1, 2, 5,
4, 2, 3, 1, 3, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 3, 3, 1, 1, 1, 8, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 4, 13, 1, 2, 1, 2, 3, 3, 1, 2, 2, 1, 3, 4, 1, 1, 1, 1, 2,
2, 4, 5, 3, 2, 2, 2, 1, 1, 3, 2523, 7, 4, 2, 4, 11, 8, 1, 4, 4, 2, 5, 3, 3,
1, 3, 1, 3, 4, 1, 1, 1, 1, 6, 6, 2, 2, 1, 8, 8, 3, 3, 4, 5, 2, 2, 2, 3, 2,
6, 2, 2, 2, 1, 5, 5, 4, 3, 1, 2, 2, 6, 3, 2, 2, 2, 10, 9, 1, 2, 1, 1, 1, 2,
2, 3, 1, 3, 1, 9, 1, 1, 1, 2, 1, 96, 2, 2, 5, 1, 1, 1, 2, 2, 1, 1, 1, 5, 2,
1, 1, 1, 2, 1, 1, 4, 2, 10, 3, 2, 2, 8, 8, 2, 1, 2, 4, 1, 1, 13, 20, 3, 2,
5, 9, 1, 22, 25, 4, 1, 1, 3, 2, 1, 1, 7, 9, 5, 9, 1, 3, 1, 8, 2, 2, 1, 3,
1, 2, 6, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 16, 3, 5, 2)
Expanding on our discussion in the comments...
An empirical cumulative distribution function is a plot of X (x axis) vs. Pr(X < x) (y axis). So for your example it would look something like this:
plot(Y.ecdf,do.points = FALSE,
verticals = TRUE,col = "blue",
xlab = "x", ylab = "Pr(X < x)")
If you look very closely you can see where the line goes up when you reach your very large values, but it's hard to make out since so many of your values are less than 10.
What you've done is to invert this function so that you're looking at the opposite tail of the distribution, i.e. Pr(X > x). You've also scaled the probabilities on the y axis. I'm not sure why, but whatever. It might make sense given your particular task. So you're doing something like this (but with the y axis scaling):
curve((1-Y.ecdf(x)), n = 10000,
from = 0, to = 2600, ylab = "Pr(X > x)",
xlab = "x", col="blue", lty=1, lwd=2)
but you originally had the from
and to
arguments set to only plot the function from 0 to 100. If you wanted to "zoom in" on your outliers, you could just change the from
and to
values to something more relevant:
curve((1-Y.ecdf(x)), n = 10000,
from = 250, to = 2600, ylab = "Pr(X > x)",
xlab = "x", col="blue", lty=1, lwd=2)
精彩评论