How to improve this Algorithm?
R Version 2.11.1 32-bit on Windows 7
I get the data train.txt as below:
USER_A USER_B ACTION
1 7 0
1 8 1
2 6 2
2 7 1
3 8 2
And I deal with the data as the algorithm below:
train_data=read.table("train.txt",header=T)
result=matrix(0,length(unique(train_data$USER_B)),2)
result[,1]=unique(train_data$USER_B)
for(i in 1:dim(result)[1])
{
temp=train_data[t开发者_如何学Crain_data$USER_B%in%result[i,1],]
result[i,2]=sum(temp[,3])/dim(temp)[1]
}
the result is the score of every USER_B in train_data. the score is defined as:
score of USER_B=(the sum of all the ACTION of USER_B)/(the recommend times of USER_B)
but the train_data is very large, it may take me three days to finish this program, so I come here to ask for help, could this algorithm be improved?
Running your example, your desired result is to calculate the mean ACTION for each unique USER_B:
[,1] [,2]
[1,] 7 0.5
[2,] 8 1.0
[3,] 6 2.0
You can do this with one line of code using the ddply()
function in package plyr
library(plyr)
ddply(train_data[, -1], .(USER_B), numcolwise(mean))
USER_B ACTION
1 6 2.0
2 7 0.5
3 8 1.0
Alternatively, the function tapply
in base R does the same:
tapply(train_data$ACTION, train_data$USER_B, mean)
Depending on the size of your table, you can get an improvement in execution time of 20x or higher. Here is the system.time test for a data.frame with a million entries. Your algorithm takes 116 seconds, ddply() takes 5.4 seconds, and tapply takes 1.2 seconds:
train_data <- data.frame(
USER_A = 1:1e6,
USER_B = sample(1:1e3, size=1e6, replace=TRUE),
ACTION = sample (1:100, size=1e6, replace=TRUE))
yourfunction <- function(){
result <- matrix(0,length(unique(train_data$USER_B)),2)
result[,1] <- unique(train_data$USER_B);
for(i in 1:dim(result)[1]){
temp=train_data[train_data$USER_B%in%result[i,1],]
result[i,2]=sum(temp[,3])/dim(temp)[1]
}
result
}
system.time(XX <- yourfunction())
user system elapsed
116.29 14.04 134.33
system.time(YY <- ddply(train_data[, -1], .(USER_B), numcolwise(mean)))
user system elapsed
5.43 1.60 7.19
system.time(ZZ <- tapply(train_data$ACTION, train_data$USER_B, mean))
user system elapsed
1.17 0.06 1.25
In addition to the approaches provided by @Andrie, the split()
then lapply()
approach is faster still:
> system.time(ZZ <- tapply(train_data$ACTION, train_data$USER_B, mean))
user system elapsed
1.025 0.011 1.062
> system.time(WW <- unlist(lapply(split(train_data$ACTION,
+ f = train_data$USER_B),
+ mean)))
user system elapsed
0.465 0.007 0.483
sapply()
is also just as quick for this problem:
> system.time(SS <- sapply(split(train_data$ACTION, f = train_data$USER_B),
+ mean))
user system elapsed
0.469 0.001 0.474
@gavin has already demonstrated the high performance when using a combination of split
and lapply
.
The package data.table
offers a further noticeable performance increase of ~75%
library(data.table)
system.time({
VV <- as.data.table(train_data)[, list(ACTION=mean(ACTION)), by=USER_B]
})
user system elapsed
0.15 0.02 0.17
system.time(WW <- unlist(lapply(split(train_data$ACTION, f = train_data$USER_B),mean)))
user system elapsed
0.61 0.02 0.63
all(WW==VV$ACTION)
[1] TRUE
The data.table
package is available at CRAN and has website on r-forge
You can try at tapply
:
train_data <- read.table("train.txt",header=T);
result <- tapply(train_data$ACTION,train_data$USER_B,function(x) sum(x)/length(x));
You can use mean
instead of function..
, but I have recently read that this last solution is faster (if you don't have any NA
s etc.).
I have not tested but I believe this should be faster. If you want even a faster solution, have at look Rcpp
and inline
packages...
精彩评论