开发者

How to deal with data matching specific criteria in a large dataset in R

I have a large data set in R (1.2M records). Those are some readings for different protocols. Now, I would like to classify this data (which I can do with rpart/RWeka). However, I first need to process the data, and this question is exactly about that.

The data set consists of a pair of outputs (throughput,response time) per set of control parameters, for 4 different protocols. Now, I would like to "bin" these values, and for each set of control parameters choose only those protocols which are in 10% of the maximum throughput (for that set of input params), and in 10% of minimim response time.

I know I can use aggregate to find max throughput, min response time in another data.frame, and then join it with original data.frame. Then, I can use ifelse to find those protocol names matching criteria. However, that seems to me as inefficient, and I don't know how would I encode multiple matches (per set of input values) in a single column.

Any suggestions?

Example (REQS and REPS are input parameters):

PROTO  REQS  REPS  THR  RT
A      8     8     10   1
B      8     8 开发者_Python百科    9.5  2
C      8     8     7    1.1
A      16    8     10   4
B      16    8     5    1
C      16    8     1    0.5
A      8     16    8    1
B      8     16    10   1.09
C      8     16    9.5  1

Should produce something like:

REQS REPS THRGOOD RTGOOD BOTHGOOD
8    8    A,B     A,C    A
16   8    A       C      empty
8    16   B,C     A,B,C  B,C


ddplyfrom the plyrpackage should be your friend here.

First, write a function that will give you the desired result if you were to get a data.frame with only the rows for 1 set of input parameters:

forOneSet<-function(dfr)
{
  THRlim<-0.9*max(dfr$THR) #is this what you want - adapt if needed?
  RTlim<-0.1*min(dfr$RT) #is this what you want - rather unlikely - adapt if needed?
  thrgood<-dfr$PROTO[dfr$THR > THRlim]
  rtgood<-dfr$PROTO[dfr$RT < RTlim]
  bothgood<-union(thrgood, rtgood)
  #return a data.frame with the wanted results for this 'partial' data.frame
  data.frame(REQS=dfr$REQS[1], REPS=dfr$REPS[1], THRGOOD=paste(thrgood, collapse=","), RTGOOD=paste(rtgood, collapse=","), BOTHGOOD=paste(bothgood, collapse=","))
}

Now you can immediately use ddply (I'm assuming your original data.frame is called orgdfr):

result<-ddply(orgdfr, .(REQS, REPS), forOneSet)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜