Dividing specific values between two arrays
I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.
What I would like to do is the following:
Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.
Divide original value by the its average YYYYMMbucket value.
Here is sample of my original data set:
YYYYMMbucket NY1 NY2 NY3 NY4 1 200701.3 0.309 NA 20.719 16260 2 200701.3 0.265 NA 19.482 15138 3 200701.3 0.239 NA 19.168 14418 4 200701.3 0.225 NA 19.106 14046 5 200701.3 0.223 NA 19.211 14040 6 200701.3 0.234 NA 19.621 14718 7 200701.3 0.270 NA 20.522 15780 8 200701.3 0.298 NA 22.284 16662 9 200701.2 0.330 NA 23.420 16914 10 200701.2 0.354 NA 23.805 17310 11 200701.2 0.388 NA 24.095 17448 12 200701.2 0.367 NA 23.954 17640 13 200701.2 0.355 NA 23.255 17748 14 200701.2 0.346 NA 22.731 17544 15 200701.2 0.347 NA 22.445 17472 16 200701.2 0.366 NA 21.945 17634 17 200701.2 0.408 NA 22.683 18876 18 200701.2 0.478 NA 23.189 21498 19 200701.2 0.550 NA 23.785 22284 20 200701.2 0.601 NA 24.515 22368
This is what my averages look like:
YYYYMMbucket NY1M NY2M 1 200701.1 0.4424574 NA 2 200701.2 0.4530000 NA 3 200701.3 0.2936935 NA 4 200702.1 0.4624063 NA 5 200702.2 0.4785937 NA 6 200702.3 0.3091161 NA 7 200703.1 0.4159687 NA 8 200703.2 0.4491875 NA 9 200703.3 0.2840081 NA 10 200704.1 0.4279137 NA
How I would like my开发者_开发技巧 final output to look:
Here's how I did it:NY1avgs NY2avgs NY3avgs 1 1.052117 NA 0.7560868 2 0.9023011 NA 0.7109456 3 0.8137734 NA 0.699487 4 0.7661047 NA 0.6972245 5 0.7592949 NA 0.7010562 6 0.7967489 NA 0.7160181 7 0.9193256 NA 0.7488978 8 1.014663 NA 0.8131974 9 0.7284768 NA 0.857904
First I used "plyr" to compute my averages, simple enough:
test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise,
NY1M = mean(NY1), NY2M = mean(NY2) ... ...))
Then used a series of the following:
x <- c(1:40893)
lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]
NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))
NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))
NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)
I also tried a variant of the above by saying:
NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))
lapply(x, NY1Fun)
Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?
Here is the customary data.table
approach, which works pretty fast.
# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
bucket = sample(letters, N, replace = T),
NY1 = runif(N),
NY2 = runif(N),
NY3 = runif(N),
NY4 = runif(N)
)
# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans = mydt[,lapply(.SD, scale_x), by = 'bucket']
How about:
test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]
In this case, I would use ave
instead of ddply
because ave
returns a vector the same length as its input. ave
only accepts a vector, so you need to use lapply
to loop over the columns of your data.frame.
myFun <- function(x, groupVar) {
x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))
精彩评论