开发者

How to efficiently sum over levels defined in another variable?

I am new to R. Now I have a function as follow:

funItemAverRating = function()
{
    itemRatingNum = array(0, itemNum);
    print("begin");
    apply(input, 1, function(x)
        {
            itemId = x[2]+1;
            itemAverRating[itemId] <<- it开发者_如何转开发emAverRating[itemId] + x[3];
            itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
        }
    );
}

In this function input is a n*3 data frame, n is ~6*(10e+7), itemRatingNum is a vector of size ~3*(10e+5).

My question is why the apply function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply function. Can anybody help me?

cheng


It's slow because you call high-level R functions many times.

You have to vectorize your function, meaning that most operations (like <- or +1) should be computed over all data vectors.

For example it looks to me that itemRatingNum holds frequencies of input[[2]] (second column of input data.frame) which could be replaced by:

tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb


Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating vector a value from a third column in some input dataframe.

What itemRatingNum is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<- assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.

Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :

funItemAverRating = function(x,input){
    sums <- rowsum(input[,3],input[,2])
    sumid <- as.numeric(rownames(sums))+1
    x[sumid]+c(sums)
}

FUNCTION EDITED PER MAREKS COMMENT

Which works like :

# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
    a1 = rep(1:10,itemNum),
    a2 = sample(9:0,itemNum*10,TRUE),
    a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
 0  1  2  3  4  5  6  7  8  9 
39 65 57 36 62 33 98 62 60 38 

If I try your code, I get :

> funItemAverRating()
[1] "begin"
...
> itemAverRating
 [1] 39 65 57 36 62 33 98 62 60 38

Which is the same. If you want itemRatingNum, then just do :

> itemRatingNum <- table(input[,2])
 0  1  2  3  4  5  6  7  8  9 
 6 11 11  8 10  6 18  9 13  8 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜