开发者

John Tukey "median median" (or "resistant line") statistical test for R and linear regression

I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.

A student on a mailling list explain this algorithm in these terms :

"The way it's calculated is to divide the data into three groups, find the x-median and y-median values (called the summary point) for each group, and then use those three summary points to determine the line. The outer two summary points determine the slope, and an average of all of them determines the intercept."

Article about John tukey's median 开发者_运维百科median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/

Do you have an idea of where i could find this algorithm or R function ? In which packages, Thanks a lot !


There's a description of how to calculate the median-median line here. An R implementation of that is

median_median_line <- function(x, y, data)
{
  if(!missing(data))
  {
    x <- eval(substitute(x), data) 
    y <- eval(substitute(y), data) 
  }
  
  stopifnot(length(x) == length(y))

  #Step 1
  one_third_length <- floor(length(x) / 3)
  groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
     one_third_length,
     c(one_third_length, one_third_length + 1, one_third_length),
     c(one_third_length + 1, one_third_length, one_third_length + 1)
  ))

  #Step 2
  x <- sort(x)
  y <- sort(y)
  
  #Step 3
  median_x <- tapply(x, groups, median)                                 
  median_y <- tapply(y, groups, median)

  #Step 4
  slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
  intercept <- median_y[1] - slope * median_x[1]

  #Step 5
  middle_prediction <- intercept + slope * median_x[2]
  intercept <- intercept + (median_y[2] - middle_prediction) / 3
  c(intercept = unname(intercept), slope = unname(slope))
}

To test it, here's an example:

dfr <- data.frame(
  time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
  distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
  
median_median_line(time, distance, dfr) 
#intercept     slope 
#   -113.6     520.0

Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.


I'm a little late to the party, but have you tried line() from the stats package?

From the helpfile:

Value

An object of class "tukeyline".

References

Tukey, J. W. (1977). Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley.


As member of the R Core team, I now have digged in the source code, and also studied the history of it.

Conclusion: The source C source code, added in 19961997, when R was still called alpha (and around version 0.14alpha) already computed the quantiles not quite correctly... for some sample sizes.

More about this on the R mailing lists (not yet).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜