speeding up data frame matching
I have two dataframes, much like these:
data = data.frame(data=cbind(1:12,rep(c(1,2),6),rep(c(1,2,3),4)))
colnames(data)=c('v','h','c')
lookup = data.frame(data=cbind(c(rep(1,3),rep(2,3)),rep(c(1,2,3),2),21:26))
colnames(lookup)=c('h','c','t')
I want to subtract lookup$t from data$v where the h and c columns match.
I thought something like this would work
data$v-lookup$t[lookup$h==data$h&lookup$c==data$c]
but doesn't magically know that I want to implicitly iterate over the rows of data
I ended up doing this
myt = c()
for(i in 1:12) {
myt[i] = lookup$t[lookup$h==data$h[i]&lookup$c==data$c[i]]
}
which works fine, but I'm hoping someone ca开发者_运维百科n suggest a more sensible way that doesn't involve a loop.
Sounds like you could merge and then do the math:
dataLookedUp <- merge(data, lookup)
dataLookedUp$newValue <- with(dataLookedUp, v - t )
For your real data, is the merge and calc faster?
If data and/or lookup is really big you might use data.table to create an index before the merge in order to speed it up.
An alternative that is 1.) more familiar to those accustomed to SQL queries and 2.) often faster than the standard merge is to use the sqldf package. (Note that on Mac OS X, you'll probably want to install Tcl/Tk, on which sqldf depends.) As an added bonus, sqldf converts strings to factors automagically by default.
install.packages("sqldf")
library(sqldf)
data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4))
lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t = 21:26)
soln <- sqldf("select * from data inner join lookup using (h, c)")
soln <- transform(soln, v.minus.t = v - t)
With your integer columns, I don't think there's anything you can do to improve on JD's offering, but if you had strings in the columns on which you merge, you could create factors with as.factor
, which could speed up the merge
depending on the size of your data set and how many merges/sorts you anticipate:
data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4))
lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t = 21:26)
data <- transform(data, h = as.factor(h), c = as.factor(c))
lookup <- transform(lookup, h = as.factor(h), c = as.factor(c))
temp <- merge(data, lookup)
temp <- transform(temp, v.minus.t = v - t)
This is perfect for data.table
using by without by
library(data.table)
data <- as.data.table(data)
lookup <- as.data.table(lookup)
setkey(data, h, c)
setkey(lookup, h,c)
data[lookup, list(v,t, newValue = v-t)]
精彩评论