How to deal with data matching specific criteria in a large dataset in R
I have a large data set in R (1.2M records). Those are some readings for different protocols. Now, I would like to classify this data (which I can do with rpart/RWeka). However, I first need to process the data, and this question is exactly about that.
The data set consists of a pair of outputs (throughput,response time) per set of control parameters, for 4 different protocols. Now, I would like to "bin" these values, and for each set of control parameters choose only those protocols which are in 10% of the maximum throughput (for that set of input params), and in 10% of minimim response time.
I know I can use aggregate to find max throughput, min response time in another data.frame, and then join it with original data.frame. Then, I can use ifelse to find those protocol names matching criteria. However, that seems to me as inefficient, and I don't know how would I encode multiple matches (per set of input values) in a single column.
Any suggestions?
Example (REQS and REPS are input parameters):
PROTO REQS REPS THR RT
A 8 8 10 1
B 8 8 开发者_Python百科 9.5 2
C 8 8 7 1.1
A 16 8 10 4
B 16 8 5 1
C 16 8 1 0.5
A 8 16 8 1
B 8 16 10 1.09
C 8 16 9.5 1
Should produce something like:
REQS REPS THRGOOD RTGOOD BOTHGOOD
8 8 A,B A,C A
16 8 A C empty
8 16 B,C A,B,C B,C
ddply
from the plyr
package should be your friend here.
First, write a function that will give you the desired result if you were to get a data.frame with only the rows for 1 set of input parameters:
forOneSet<-function(dfr)
{
THRlim<-0.9*max(dfr$THR) #is this what you want - adapt if needed?
RTlim<-0.1*min(dfr$RT) #is this what you want - rather unlikely - adapt if needed?
thrgood<-dfr$PROTO[dfr$THR > THRlim]
rtgood<-dfr$PROTO[dfr$RT < RTlim]
bothgood<-union(thrgood, rtgood)
#return a data.frame with the wanted results for this 'partial' data.frame
data.frame(REQS=dfr$REQS[1], REPS=dfr$REPS[1], THRGOOD=paste(thrgood, collapse=","), RTGOOD=paste(rtgood, collapse=","), BOTHGOOD=paste(bothgood, collapse=","))
}
Now you can immediately use ddply
(I'm assuming your original data.frame is called orgdfr):
result<-ddply(orgdfr, .(REQS, REPS), forOneSet)
精彩评论