Advice on calculating a function to describe upper bound of data

2023-02-02 03:59 问答作者：

I ha开发者_开发百科ve a scatter plot of a dataset and I am interested in calculating the upper bound of the data. I don't know if this is a standard statistical approach so what I was considering doing was splitting the X-axis data into small ranges, calculating the max for these ranges and then trying to identify a function to describe these points. Is there a function already in R to do this?

If it's relevant there are 92611 points.

You might like to look into quantile regression, which is available in the quantreg package. Whether this is useful will depend on whether you want the absolute maximum within your "windows" are whether some extreme quantile, say 95th or 99th, is acceptable? If you are not familiar with quantile regression, then consider the linear regression which fits a model for the expectation or mean response, conditional upon the model covariates. Quantile regression for the middle quantile (0.5) would fit a model to the median response, conditional upon the model covariates.

Here is an example using the quantreg package, to show you what I mean. First, generate some dummy data similar to the data you show:

set.seed(1)
N <- 5000
DF <- data.frame(Y = rev(sort(rlnorm(N, -0.9))) + rnorm(N),
                 X = seq_len(N))
plot(Y ~ X, data = DF)

Next, fit the model to the 99th percentile (or the 0.99 quantile):

mod <- rq(Y ~ log(X), data = DF, tau = .99)

To generate the "fitted line", we predict from the model at 100 equally spaced values in X

pDF <- data.frame(X = seq(1, 5000, length = 100))
pDF <- within(pDF, Y <- predict(mod, newdata = pDF))

and add the fitted model to the plot:

lines(Y ~ X, data = pDF, col = "red", lwd = 2)

This should give you this:

Advice on calculating a function to describe upper bound of data

I would second Gavin's nomination for using quantile regression. Your data might be simulated with your X and Y each log-normally distributed. You can see what a plot of the joint distribution of two independent (no imposed correlation, but not necessarily cor(x,y)==0) log-normal variates looks like if you run:

x <- rlnorm(1000, log(300), sdlog=1)
y<- rlnorm(1000, log(7), sdlog=1)
plot(x,y, cex=0.3)

Advice on calculating a function to describe upper bound of data

You might consider looking at their individual distributions with qqplot (in the base plotting functions) remembering that the tails of such distrubutions can behave in surprising manner. You should be more interested in how well the bulk of the values fit a particular distribution than the extremes ... unless of course your applications are in finance or insurance. Don't want another global financial crisis because of poor modeling assumptions about tail behavior, now do we?

qqplot(x, rlnorm(10000, log(300), sdlog=1) )

继续阅读：r statistics

Advice on calculating a function to describe upper bound of data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？