How to deal with data matching specific criteria in a large dataset in R

2023-03-31 23:03 问答作者：

I have a large data set in R (1.2M records). Those are some readings for different protocols. Now, I would like to classify this data (which I can do with rpart/RWeka). However, I first need to process the data, and this question is exactly about that.

The data set consists of a pair of outputs (throughput,response time) per set of control parameters, for 4 different protocols. Now, I would like to "bin" these values, and for each set of control parameters choose only those protocols which are in 10% of the maximum throughput (for that set of input params), and in 10% of minimim response time.

I know I can use aggregate to find max throughput, min response time in another data.frame, and then join it with original data.frame. Then, I can use ifelse to find those protocol names matching criteria. However, that seems to me as inefficient, and I don't know how would I encode multiple matches (per set of input values) in a single column.

Any suggestions?

Example (REQS and REPS are input parameters):

PROTO  REQS  REPS  THR  RT
A      8     8     10   1
B      8     8 开发者_Python百科    9.5  2
C      8     8     7    1.1
A      16    8     10   4
B      16    8     5    1
C      16    8     1    0.5
A      8     16    8    1
B      8     16    10   1.09
C      8     16    9.5  1

Should produce something like:

REQS REPS THRGOOD RTGOOD BOTHGOOD
8    8    A,B     A,C    A
16   8    A       C      empty
8    16   B,C     A,B,C  B,C

ddplyfrom the plyrpackage should be your friend here.

First, write a function that will give you the desired result if you were to get a data.frame with only the rows for 1 set of input parameters:

forOneSet<-function(dfr)
{
  THRlim<-0.9*max(dfr$THR) #is this what you want - adapt if needed?
  RTlim<-0.1*min(dfr$RT) #is this what you want - rather unlikely - adapt if needed?
  thrgood<-dfr$PROTO[dfr$THR > THRlim]
  rtgood<-dfr$PROTO[dfr$RT < RTlim]
  bothgood<-union(thrgood, rtgood)
  #return a data.frame with the wanted results for this 'partial' data.frame
  data.frame(REQS=dfr$REQS[1], REPS=dfr$REPS[1], THRGOOD=paste(thrgood, collapse=","), RTGOOD=paste(rtgood, collapse=","), BOTHGOOD=paste(bothgood, collapse=","))
}

Now you can immediately use ddply (I'm assuming your original data.frame is called orgdfr):

result<-ddply(orgdfr, .(REQS, REPS), forOneSet)

How to deal with data matching specific criteria in a large dataset in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？