Generate sets for cross-validation

2023-04-04 18:34 问答作者：

How to split automatically a m开发者_开发技巧atrix using R for 5-fold cross-validation? I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).

I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :

X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix

I'd work with the list, as it allows you to do something like :

names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)

which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!

In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :

id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))

gives you a test matrix that's double the size of the train matrices.

In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :

indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)

to get matrices with respectively 100, 20, ... and 40 cases.

f_K_fold <- function(Nobs,K=5){
    rs <- runif(Nobs)
    id <- seq(Nobs)[order(rs)]
    k <- as.integer(Nobs*seq(1,K-1)/K)
    k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
    k[,1] <- k[,1]+1
    l <- lapply(seq.int(K),function(x,k,d) 
                list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
                     test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
   return(l)
}

Solution without split:

set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds

# Generating random indices 
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1  2  3  4  5 
# 67 67 67 66 66 

# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
    test_matrix_indices = which(id==a),
    train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
#   ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
#   ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...

But you could return matrices too:

matrices <- lapply(seq_len(k), function(a) list(
    test_matrix = X[id==a, ],
    train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
  # ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
  # ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...

Then you could use lapply to get results:

lapply(matrices, function(x) {
     m <- build_model(x$train_matrix)
     performance(m, x$test_matrix)
})

Edit: compare to Wojciech's solution:

f_K_fold <- function(Nobs, K=5){
    id <- sample(rep(seq.int(K), length.out=Nobs))
    l <- lapply(seq.int(K), function(x) list(
         train = which(x!=id),
         test  = which(x==id)
    ))
    return(l)
}

Edit : Thanks for your answers. I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :

n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)

Then I use :

for (k in 1:K) {
    matrix_train<-matrix[block!=k,]
    matrix_test<-matrix[block==k,]
    [Algorithm sequence]
    }

in order to generate the adequate sets for each iterations.

However this solution can omit one individual for tests. I do not recommend it.

Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.

X <- read.csv('data.csv')

k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement

log_models <- new.env(hash=T, parent=emptyenv()) 
for (i in 1:k){
  train <- X[id != i,]
  test <- X[id == i,]
  # run algorithm, e.g. logistic regression
  log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}

The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.

Example:

library(sperrorest)
data(ecuador)

## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)

# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test

# test sample used in this particular repetition and fold:
ecuador[idx , ]

If you have a spatial data set (with coords), you can also visualize your generated folds

# this may take some time...
plot(resamp, ecuador)

Generate sets for cross-validation

Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).

Generate sets for cross-validation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？