开发者

missing values - Hot Deck neighbour method

I'm having a problem with R code, rather, with missing values. Don't know actually, how to impute those values using simple Hot Deck method. Like, example, having these data.

1  10000123  111  112820 0.24457235         NA        NA         NA         NA     11
2  10000132  111 2502357 0.19408587 0.19373610 0.6567305 0.01454520 0.13498823     69
3  10000388  111 4472360 0.14774927 0.14918678 0.6853377 0.05233508 0.11314044    106
4  10000792  111  666909 0.10520063         NA        NA         NA         NA     14
5  10002737  111 1139613 0.19944986 0.20114918 0.3564355 0.20135391 0.24106136     23
6  10002741  111  981574 0.11573570         NA        NA         NA         NA     13
7  10002929  111 1417192 0.08770932 0.08387991 0.6106012 0.11078473 开发者_JAVA百科0.19473415     24
8  10003396  111  444966 0.19026263 0.18784110 0.5215772 0.16844381 0.12213789     24
9  10003517  111 1230589 0.16393216 0.16358568 0.4614005 0.26670712 0.10830670     19
10 10003546  111  760847 0.12384748         NA        NA         NA         NA     10

Using 5th column, need to find the nearest value, and then fill with that similar respondent in those places, where are NA values.

Thank You.


I've never used hot (or cold for that matter) deck sampling. However a little Googling led me to the rrp.impute function in the rrp package.

Here's a simple example using some synthetic data:

install.packages("rrp")
require(rrp)
set.seed(1)
key <- 1:100
## create random values
value1 <- 10 + 2 * key + rnorm(100, 0, 10)
## make 5 values into NAs
missing <- sample( key, 5)
value1[missing] <- NA
## build a dataframe
df <- data.frame(key, value1)
## do a nearest neighbor hot deck interpolation
imputed <- rrp.impute( df )$new.data

## let's visualize this magic
plot( df)
points(missing, imputed$value1[missing], col="red")

This uses the default value of k=1, which is what I think you want. The pretty picture at the end looks like this:

missing values - Hot Deck neighbour method

The red circles are the imputed values and you can see they are simply the nearest neighbor.


I don't know if there is a ready-made R package, but this does the trick:

dfr<-data.frame(c1=c(123,132,388,792,2737,2741,2929,3396,3517,3546),
 c2=c(0.244,0.194,0.47,0.105,0.199,0.115,0.087,0.190,0.163,0.123),
 c3=c(NA, 0.193,0.149, NA, 0.201, NA, 0.083,0.187,0.163,NA))

hdidx<-which(!is.na(dfr[,3]))
hd<-dfr[hdidx,]
md<-dfr[-hdidx,]
closesthd<-sapply(md[,2], function(curval){which.min(abs(curval-hd[,2]))})
md[,3]<-hd[closesthd,3]

Replace column numbers where needed for your case + maybe take another distance measure.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜