开发者

speeding up data frame matching

I have two dataframes, much like these:

data = data.frame(data=cbind(1:12,rep(c(1,2),6),rep(c(1,2,3),4)))
colnames(data)=c('v','h','c')

lookup = data.frame(data=cbind(c(rep(1,3),rep(2,3)),rep(c(1,2,3),2),21:26))
colnames(lookup)=c('h','c','t')

I want to subtract lookup$t from data$v where the h and c columns match.

I thought something like this would work

data$v-lookup$t[lookup$h==data$h&lookup$c==data$c]

but doesn't magically know that I want to implicitly iterate over the rows of data

I ended up doing this

myt = c()
for(i in 1:12) {
myt[i] = lookup$t[lookup$h==data$h[i]&lookup$c==data$c[i]]
}

which works fine, but I'm hoping someone ca开发者_运维百科n suggest a more sensible way that doesn't involve a loop.


Sounds like you could merge and then do the math:

dataLookedUp <- merge(data, lookup)
dataLookedUp$newValue <- with(dataLookedUp, v - t )

For your real data, is the merge and calc faster?

If data and/or lookup is really big you might use data.table to create an index before the merge in order to speed it up.


An alternative that is 1.) more familiar to those accustomed to SQL queries and 2.) often faster than the standard merge is to use the sqldf package. (Note that on Mac OS X, you'll probably want to install Tcl/Tk, on which sqldf depends.) As an added bonus, sqldf converts strings to factors automagically by default.

install.packages("sqldf")
library(sqldf)
data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4))
lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t =  21:26)
soln <- sqldf("select * from data inner join lookup using (h, c)")
soln <- transform(soln, v.minus.t = v - t)


With your integer columns, I don't think there's anything you can do to improve on JD's offering, but if you had strings in the columns on which you merge, you could create factors with as.factor, which could speed up the merge depending on the size of your data set and how many merges/sorts you anticipate:

data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4))
lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t =  21:26)
data <- transform(data, h = as.factor(h), c = as.factor(c))
lookup <- transform(lookup, h = as.factor(h), c = as.factor(c))
temp <- merge(data, lookup)
temp <- transform(temp, v.minus.t = v - t)


This is perfect for data.table using by without by

library(data.table)
data <- as.data.table(data)
lookup <- as.data.table(lookup)
setkey(data, h, c)
setkey(lookup, h,c)

data[lookup, list(v,t, newValue = v-t)]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜