missing values - Hot Deck neighbour method
I'm having a problem with R code, rather, with missing values. Don't know actually, how to impute those values using simple Hot Deck method. Like, example, having these data.
1 10000123 111 112820 0.24457235 NA NA NA NA 11
2 10000132 111 2502357 0.19408587 0.19373610 0.6567305 0.01454520 0.13498823 69
3 10000388 111 4472360 0.14774927 0.14918678 0.6853377 0.05233508 0.11314044 106
4 10000792 111 666909 0.10520063 NA NA NA NA 14
5 10002737 111 1139613 0.19944986 0.20114918 0.3564355 0.20135391 0.24106136 23
6 10002741 111 981574 0.11573570 NA NA NA NA 13
7 10002929 111 1417192 0.08770932 0.08387991 0.6106012 0.11078473 开发者_JAVA百科0.19473415 24
8 10003396 111 444966 0.19026263 0.18784110 0.5215772 0.16844381 0.12213789 24
9 10003517 111 1230589 0.16393216 0.16358568 0.4614005 0.26670712 0.10830670 19
10 10003546 111 760847 0.12384748 NA NA NA NA 10
Using 5th column, need to find the nearest value, and then fill with that similar respondent in those places, where are NA values.
Thank You.
I've never used hot (or cold for that matter) deck sampling. However a little Googling led me to the rrp.impute
function in the rrp package.
Here's a simple example using some synthetic data:
install.packages("rrp")
require(rrp)
set.seed(1)
key <- 1:100
## create random values
value1 <- 10 + 2 * key + rnorm(100, 0, 10)
## make 5 values into NAs
missing <- sample( key, 5)
value1[missing] <- NA
## build a dataframe
df <- data.frame(key, value1)
## do a nearest neighbor hot deck interpolation
imputed <- rrp.impute( df )$new.data
## let's visualize this magic
plot( df)
points(missing, imputed$value1[missing], col="red")
This uses the default value of k=1, which is what I think you want. The pretty picture at the end looks like this:
The red circles are the imputed values and you can see they are simply the nearest neighbor.
I don't know if there is a ready-made R package, but this does the trick:
dfr<-data.frame(c1=c(123,132,388,792,2737,2741,2929,3396,3517,3546),
c2=c(0.244,0.194,0.47,0.105,0.199,0.115,0.087,0.190,0.163,0.123),
c3=c(NA, 0.193,0.149, NA, 0.201, NA, 0.083,0.187,0.163,NA))
hdidx<-which(!is.na(dfr[,3]))
hd<-dfr[hdidx,]
md<-dfr[-hdidx,]
closesthd<-sapply(md[,2], function(curval){which.min(abs(curval-hd[,2]))})
md[,3]<-hd[closesthd,3]
Replace column numbers where needed for your case + maybe take another distance measure.
精彩评论