开发者

Avoiding Loop with R using Apply (?)

I'm trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.

The function works if I just call it with test numbers but I can't figure out how to run it on all rows of my dataset. I've tried using apply but can't g开发者_Go百科et it working.

I'm going to be iterating different parameter settings to find those that best fit historical data so speed is important... meaning that a loop is out. Any help you can provide would be hugely appreciated.

Thanks! Alan

GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
    HeightParam <- 1/5000
        AgeParam <- 1
    Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]

    Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2

    return(sqrt(sum(Stock_SameType$ED)))

}

HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")

GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number

res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
        # returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852,  :  replacement has 4060 rows, data has 54
    # also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) :  longer object length is not a multiple of shorter object length

res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
    # `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261,  : replacement has 4060 rows, data has 13
    # also same 4 warnings as res1


I took some liberties with your code b/c I try to vectorize vice use loops whenever I can... With the merge function, you merge the two data frames, and operate on the "columns", which allows you to use the vectorization built into R. I think this will do what you want (in the second line I'm just making sure that A and B don't have the same values for height and age so that your distance isn't always zero):

A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )


Use mapply, the multivariate form of apply:

res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)


Code as per above comment:

A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))



AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜