开发者

fill gaps in a timeseries with averages

I have a dataframe like so:

day         sum_flux  samples mean
2005-10-26     0.02     48    0.02
2005-10-27     0.12     12    0.50

It's a se开发者_如何学Cries of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.

i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set. if all of October was missing I'd want to apply this average to each missing day.

I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.

Many Thanks


Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.

library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
  day      = days,
  sum_flux = runif(n_days),
  samples  = sample(100, n_days, replace = TRUE),
  mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA

Add a month column.

readings$month <- month(readings$day, label = TRUE)

Use tapply to get the monthly mean flux.

monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))

Use this value whenever the flux is missing, or keep the flux if not.

readings$sum_flux2 <- with(readings, ifelse(
  is.na(sum_flux), 
  monthly_avg_flux[month], 
  sum_flux
))


This is one (very fast) way in data.table.

Using the nice example data from Richie :

require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
    day      = days,
    sum_flux = runif(n_days),
    samples  = sample(100, n_days, replace = TRUE),
    mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01         NA      42 0.37566491
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed.

Create the average for each month, in appearance order of each group :

> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
      mnth        V1
 [1,]   10 0.4915999
 [2,]   11 0.5107873
 [3,]   12 0.4451787
 [4,]    1 0.4966040
 [5,]    2 0.4972244
 [6,]    3 0.4952821
 [7,]    4 0.5106539
 [8,]    5 0.4717122
 [9,]    6 0.5110490
[10,]    7 0.4507383
[11,]    8 0.4680827
[12,]    9 0.5150618

Next reorder avg to start in January :

avg = avg[order(mnth)]
avg
      mnth        V1
 [1,]    1 0.4966040
 [2,]    2 0.4972244
 [3,]    3 0.4952821
 [4,]    4 0.5106539
 [5,]    5 0.4717122
 [6,]    6 0.5110490
 [7,]    7 0.4507383
 [8,]    8 0.4680827
 [9,]    9 0.5150618
[10,]   10 0.4915999
[11,]   11 0.5107873
[12,]   12 0.4451787

Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.

readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01 0.51078729**    42 0.37566491  # ** updated with the Nov avg
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed. 

Done.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜