Applying a function to a distance matrix in R
This question came today in the manipulatr mailing list.
http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f
I am rephrasing.
Given a distance matrix (calculated with dist
) apply a function to the rows of the distance matrix.
Code:
library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d &开发者_开发知识库lt;- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)
The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.
Any ideas?
First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.
Here's another version without using ifelse
(that's a relatively slow function):
noeq.2 <- function(i, j, N) {
i <- i-1
j <- j-1
x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
idx <- i < j
x[!idx] <- x2[!idx]
x[i==j] <- 0
x
}
And timings on my laptop:
> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
51.31 0.10 52.06
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
2.47 0.02 2.67
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
user system elapsed
0.88 0.01 1.12
And lapply is faster than sapply:
> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
user system elapsed
0.67 0.00 0.67
This is a vectorized version of the function noeq
(either argument i
or j
):
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ifelse(i < j,
i*(N-1) - ((i-1)*i)/2 + j - i,
j*(N-1) - ((j-1)*j)/2 + i - j) * ifelse(i == j, 0, 1)
}
> N <- 4
> sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N)))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
> sapply(1:N, function(i) noeq.1(i, 1:N, N))
[,1] [,2] [,3] [,4]
[1,] 0 1 2 3
[2,] 1 0 4 5
[3,] 2 4 0 6
[4,] 3 5 6 0
Timings are done on a 2.4 GHz Intel Core 2 Duo (Mac OS 10.6.1):
> N <- 1000
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
user system elapsed
0.676 0.061 0.738
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
user system elapsed
14.359 0.032 14.410
My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru
int Trag_noeq(int row, int col, int N)
{
//assert(row != col); //You can add this in if you like
if (row<col)
return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
else if (col<row)
return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
else
return -1;
}
After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got.
EDIT: Using the vectorized version contributed by rcs
noeq.1 <- function(i, j, N) {
i <- i-1
j <- j-1
ix <- ifelse(i < j,
i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
ix
}
## To get the indexes of the row, the following one liner works:
getrow <- function(z, N) noeq.1(z, 1:N, N)
## to get the row sums
getsum <- function(d, f=sum) {
N <- attr(d, "Size")
sapply(1:N, function(i) {
if (i%%100==0) print(i)
f(d[getrow(i,N)])
})
}
So, with the example:
sumd2 <- getsum(d)
This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!
精彩评论