开发者

Dividing specific values between two arrays

I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.

What I would like to do is the following:

  1. Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.

  2. Divide original value by the its average YYYYMMbucket value.

Here is sample of my original data set:

     YYYYMMbucket    NY1  NY2  NY3   NY4
1      200701.3     0.309  NA 20.719 16260
2      200701.3     0.265  NA 19.482 15138
3      200701.3     0.239  NA 19.168 14418
4      200701.3     0.225  NA 19.106 14046
5      200701.3     0.223  NA 19.211 14040
6      200701.3     0.234  NA 19.621 14718
7      200701.3     0.270  NA 20.522 15780
8      200701.3     0.298  NA 22.284 16662
9      200701.2     0.330  NA 23.420 16914
10     200701.2     0.354  NA 23.805 17310
11     200701.2     0.388  NA 24.095 17448
12     200701.2     0.367  NA 23.954 17640
13     200701.2     0.355  NA 23.255 17748
14     200701.2     0.346  NA 22.731 17544
15     200701.2     0.347  NA 22.445 17472
16     200701.2     0.366  NA 21.945 17634
17     200701.2     0.408  NA 22.683 18876
18     200701.2     0.478  NA 23.189 21498
19     200701.2     0.550  NA 23.785 22284
20     200701.2     0.601  NA 24.515 22368

This is what my averages look like:

     YYYYMMbucket  NY1M     NY2M
1      200701.1  0.4424574   NA
2      200701.2  0.4530000   NA
3      200701.3  0.2936935   NA
4      200702.1  0.4624063   NA
5      200702.2  0.4785937   NA
6      200702.3  0.3091161   NA
7      200703.1  0.4159687   NA
8      200703.2  0.4491875   NA
9      200703.3  0.2840081   NA
10     200704.1  0.4279137   NA

How I would like my开发者_开发技巧 final output to look:

  NY1avgs   NY2avgs    NY3avgs
1  1.052117     NA  0.7560868
2  0.9023011    NA  0.7109456
3  0.8137734    NA  0.699487
4  0.7661047    NA  0.6972245
5  0.7592949    NA  0.7010562
6  0.7967489    NA  0.7160181
7  0.9193256    NA  0.7488978
8  1.014663     NA  0.8131974
9  0.7284768    NA  0.857904

Here's how I did it:

First I used "plyr" to compute my averages, simple enough:

test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise, 
    NY1M = mean(NY1), NY2M = mean(NY2) ... ...))

Then used a series of the following:

x <- c(1:40893)

lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]

NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))

NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))

NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)

I also tried a variant of the above by saying:

NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))

lapply(x, NY1Fun)

Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?


Here is the customary data.table approach, which works pretty fast.

# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
  bucket = sample(letters, N, replace = T),
  NY1    = runif(N),
  NY2    = runif(N),
  NY3    = runif(N),
  NY4    = runif(N)
)

# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans  = mydt[,lapply(.SD, scale_x), by = 'bucket']


How about:

test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]


In this case, I would use ave instead of ddply because ave returns a vector the same length as its input. ave only accepts a vector, so you need to use lapply to loop over the columns of your data.frame.

myFun <- function(x, groupVar) {
  x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
  lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜