How to efficiently sum over levels defined in another variable?
I am new to R. Now I have a function as follow:
funItemAverRating = function()
{
itemRatingNum = array(0, itemNum);
print("begin");
apply(input, 1, function(x)
{
itemId = x[2]+1;
itemAverRating[itemId] <<- it开发者_如何转开发emAverRating[itemId] + x[3];
itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
}
);
}
In this function input is a n*3
data frame, n
is ~6*(10e+7)
, itemRatingNum
is a vector of size ~3*(10e+5)
.
apply
function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply
function. Can anybody help me?
cheng
It's slow because you call high-level R functions many times.
You have to vectorize your function, meaning that most operations (like <-
or +1
) should be computed over all data vectors.
For example it looks to me that itemRatingNum
holds frequencies of input[[2]]
(second column of input
data.frame
) which could be replaced by:
tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb
Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating
vector a value from a third column in some input dataframe.
What itemRatingNum
is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<-
assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.
Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :
funItemAverRating = function(x,input){
sums <- rowsum(input[,3],input[,2])
sumid <- as.numeric(rownames(sums))+1
x[sumid]+c(sums)
}
FUNCTION EDITED PER MAREKS COMMENT
Which works like :
# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
a1 = rep(1:10,itemNum),
a2 = sample(9:0,itemNum*10,TRUE),
a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
0 1 2 3 4 5 6 7 8 9
39 65 57 36 62 33 98 62 60 38
If I try your code, I get :
> funItemAverRating()
[1] "begin"
...
> itemAverRating
[1] 39 65 57 36 62 33 98 62 60 38
Which is the same. If you want itemRatingNum, then just do :
> itemRatingNum <- table(input[,2])
0 1 2 3 4 5 6 7 8 9
6 11 11 8 10 6 18 9 13 8
精彩评论