How to improve this Algorithm?

2023-02-25 00:55 问答作者：

R Version 2.11.1 32-bit on Windows 7

I get the data train.txt as below:

USER_A USER_B ACTION
1        7      0
1        8      1
2        6      2
2        7      1
3        8      2

And I deal with the data as the algorithm below:

train_data=read.table("train.txt",header=T)
result=matrix(0,length(unique(train_data$USER_B)),2)
result[,1]=unique(train_data$USER_B)
for(i in 1:dim(result)[1])
{
    temp=train_data[t开发者_如何学Crain_data$USER_B%in%result[i,1],]
    result[i,2]=sum(temp[,3])/dim(temp)[1]
}

the result is the score of every USER_B in train_data. the score is defined as:

score of USER_B=(the sum of all the ACTION of USER_B)/(the recommend times of USER_B)

but the train_data is very large, it may take me three days to finish this program, so I come here to ask for help, could this algorithm be improved?

Running your example, your desired result is to calculate the mean ACTION for each unique USER_B:

     [,1] [,2]
[1,]    7  0.5
[2,]    8  1.0
[3,]    6  2.0

You can do this with one line of code using the ddply() function in package plyr

library(plyr)
ddply(train_data[, -1], .(USER_B), numcolwise(mean))

  USER_B ACTION
1      6    2.0
2      7    0.5
3      8    1.0

Alternatively, the function tapply in base R does the same:

tapply(train_data$ACTION, train_data$USER_B, mean)

Depending on the size of your table, you can get an improvement in execution time of 20x or higher. Here is the system.time test for a data.frame with a million entries. Your algorithm takes 116 seconds, ddply() takes 5.4 seconds, and tapply takes 1.2 seconds:

train_data <- data.frame(
        USER_A = 1:1e6,
        USER_B = sample(1:1e3, size=1e6, replace=TRUE),
        ACTION = sample (1:100, size=1e6, replace=TRUE))

yourfunction <- function(){
    result <- matrix(0,length(unique(train_data$USER_B)),2)
    result[,1] <- unique(train_data$USER_B);
    for(i in 1:dim(result)[1]){     
        temp=train_data[train_data$USER_B%in%result[i,1],]
        result[i,2]=sum(temp[,3])/dim(temp)[1]
    }
    result
}

system.time(XX <- yourfunction())
   user  system elapsed 
 116.29   14.04  134.33 

system.time(YY <- ddply(train_data[, -1], .(USER_B), numcolwise(mean)))
   user  system elapsed 
   5.43    1.60    7.19 

system.time(ZZ <- tapply(train_data$ACTION, train_data$USER_B, mean))
   user  system elapsed 
   1.17    0.06    1.25

In addition to the approaches provided by @Andrie, the split() then lapply() approach is faster still:

> system.time(ZZ <- tapply(train_data$ACTION, train_data$USER_B, mean))
   user  system elapsed 
  1.025   0.011   1.062 
> system.time(WW <- unlist(lapply(split(train_data$ACTION, 
+                                       f = train_data$USER_B), 
+                          mean)))
   user  system elapsed 
  0.465   0.007   0.483

sapply() is also just as quick for this problem:

> system.time(SS <- sapply(split(train_data$ACTION, f = train_data$USER_B), 
+                          mean))
   user  system elapsed 
  0.469   0.001   0.474

@gavin has already demonstrated the high performance when using a combination of split and lapply.

The package data.table offers a further noticeable performance increase of ~75%

library(data.table)
system.time({
      VV <- as.data.table(train_data)[, list(ACTION=mean(ACTION)), by=USER_B]
    })

user  system elapsed 
0.15    0.02    0.17 

system.time(WW <- unlist(lapply(split(train_data$ACTION, f = train_data$USER_B),mean)))

user  system elapsed 
0.61    0.02    0.63 

all(WW==VV$ACTION)
[1] TRUE

The data.table package is available at CRAN and has website on r-forge

You can try at tapply:

train_data <- read.table("train.txt",header=T);
result <- tapply(train_data$ACTION,train_data$USER_B,function(x) sum(x)/length(x));

You can use mean instead of function.., but I have recently read that this last solution is faster (if you don't have any NAs etc.).

I have not tested but I believe this should be faster. If you want even a faster solution, have at look Rcpp and inline packages...

继续阅读：r

How to improve this Algorithm?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？