开发者

Generate sets for cross-validation

How to split automatically a m开发者_开发技巧atrix using R for 5-fold cross-validation? I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).


I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :

X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix

I'd work with the list, as it allows you to do something like :

names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)

which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!

In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :

id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))

gives you a test matrix that's double the size of the train matrices.

In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :

indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)

to get matrices with respectively 100, 20, ... and 40 cases.


f_K_fold <- function(Nobs,K=5){
    rs <- runif(Nobs)
    id <- seq(Nobs)[order(rs)]
    k <- as.integer(Nobs*seq(1,K-1)/K)
    k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
    k[,1] <- k[,1]+1
    l <- lapply(seq.int(K),function(x,k,d) 
                list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
                     test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
   return(l)
}


Solution without split:

set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds

# Generating random indices 
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1  2  3  4  5 
# 67 67 67 66 66 

# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
    test_matrix_indices = which(id==a),
    train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
#   ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
#   ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
#  $ :List of 2
#   ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
#   ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...

But you could return matrices too:

matrices <- lapply(seq_len(k), function(a) list(
    test_matrix = X[id==a, ],
    train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
  # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
  # ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
 # $ :List of 2
  # ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
  # ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...

Then you could use lapply to get results:

lapply(matrices, function(x) {
     m <- build_model(x$train_matrix)
     performance(m, x$test_matrix)
})

Edit: compare to Wojciech's solution:

f_K_fold <- function(Nobs, K=5){
    id <- sample(rep(seq.int(K), length.out=Nobs))
    l <- lapply(seq.int(K), function(x) list(
         train = which(x!=id),
         test  = which(x==id)
    ))
    return(l)
}


Edit : Thanks for your answers. I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :

n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)

Then I use :

for (k in 1:K) {
    matrix_train<-matrix[block!=k,]
    matrix_test<-matrix[block==k,]
    [Algorithm sequence]
    }

in order to generate the adequate sets for each iterations.

However this solution can omit one individual for tests. I do not recommend it.


Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.

X <- read.csv('data.csv')

k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement

log_models <- new.env(hash=T, parent=emptyenv()) 
for (i in 1:k){
  train <- X[id != i,]
  test <- X[id == i,]
  # run algorithm, e.g. logistic regression
  log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}


The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.

Example:

library(sperrorest)
data(ecuador)

## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)

# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test

# test sample used in this particular repetition and fold:
ecuador[idx , ]

If you have a spatial data set (with coords), you can also visualize your generated folds

# this may take some time...
plot(resamp, ecuador)

Generate sets for cross-validation

Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜