开发者

Median of grouped data

I have a dataset containing the number of infants born per gestational week开发者_Go百科.

I am trying to determine the median gestational age of delivery based on the frequency of infants born for this particular year

For example:

GA num_infants_born
20 weeks 16
21 weeks 22
22 weeks 34
23 weeks 45
24 weeks 60
25 weeks 67
26 weeks 94

and onwards, until 41 weeks. The distribution is (not surprisingly) left skewed

I also calculated cumulative frequencies using

data$cumulative_freq = cumsum(data$num_infants_born) 

Do I use the cumulative_freq column to calculate the median number of infants born that corresponds to a gestational week? Using

median(medianGA2001a$cumulative_freq)

gives me an unexpected number.

I am expecting the median GA to be around 35 weeks, based on the distribution


If I understood your question correctly you're going to want to do something like this:

# Your gestational data:
gestational_data <- data.frame(GA_weeks = c(20:26),
                               num_infants_born = c(16,22,34,45,60,67,94))

# See the apply() documentation by running 
# ?apply

apply(gestational_data,
      1,
      function(x){
        rep(x[1],x[2])
      }) |>
  unlist()|>
  median()


What you want is a weighted median. You first want the weeks as numeric, which you get using gsub if not yet available

dat$GA_num <- as.numeric(gsub('\\D', '', dat$GA))

Then, use weightedMedian from the matrixStats package with the number of infants as weights.

matrixStats::weightedMedian(dat$GA_num, w=dat$num_infants_born)
# [1] 24.34646

Note, that there are several definitions of the weighted mean. For a comprehensive discussion, see this answer.


Data:

dat <- structure(list(GA = c("20 weeks", "21 weeks", "22 weeks", "23 weeks", 
"24 weeks", "25 weeks", "26 weeks"), num_infants_born = c(16L, 
22L, 34L, 45L, 60L, 67L, 94L)), class = "data.frame", row.names = c(NA, 
-7L))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜