How to calculate the ratio of data points, i.e., combining them based on some criterion?

2023-03-31 02:08 问答作者：

(Unfortunately, I am missing basic vocabulary to formulate my question. So, please correct me where more precise terms are useful.)

I use R to do very basic statistical analysis for benchmark results of virtual machines, and I often want to normalize my data based on some criterion.

Currently my problem is that I would like something like the following to work:

normalized_data <- ddply(bench, ~ Benchmark + Configuration + Approach,
                         transform,
                         Ratio = Time / Time[Approach == "appr2"])

So, what I actually want is to calculate the speed-up between corresponding pairs of measurements.

bench is a data frame with the columns Time, Benchmark, Configuration and Approach and contains 100 measurements for all possible combinations of Benchmark, Configuration and Approach. Now I got exactly two approaches and want the speed-up of "appr2"/"appr1". Thus, just looking at one specific benchmark, and one specific configuration, I have 100 measuremen开发者_运维百科ts for "appr1" and 100 of "appr2" in my data frame. However, R gives me the following error resulting from the give query:

Error in data.frame(list(Time = c(405.73, 342.616, 404.484, 328.742, 403.384,  : 
  arguments imply differing number of rows: 100, 0

Ideally, the result of my query would result in a new data frame with the three columns SpeedUp, Benchmark, Configuration. Based on that I would then be able to calculate means, confidence intervals and so on.

But at the moment, the basic problem is how to express such a normalization. For another data set I was able to calculate a normalized value like this Time.norm = Time / Time[NumCores == min(NumCores)] but looks like that worked just by chance, at least I do not understand the difference.

Any hints are appreciate. (Especially the right terminology to search for solutions for such problems.)

Edit: Thanks to Chase's hint, here a minimal data set, which should be structurally identical to what I got, and it exhibits the same behavior with respect to the query above.

bench <- structure(list(Time = c(399.04, 388.069, 401.072, 361.646),
           Benchmark = structure(c(1L, 1L, 1L, 1L), .Label = c("Fibonacci"), class = "factor"), 
           Configuration = structure(c(1L, 1L, 1L, 1L), .Label = c("native"), class = "factor"),
           Approach = structure(c(1L, 1L, 2L, 2L), .Label = c("appr1", "appr2"), class = "factor")),
      .Names = c("Time", "Benchmark", "Configuration", "Approach"),
      row.names = c(NA, 4L), class = "data.frame")

If you try to do this within ddply in the manner I naively attempted at first, you find that you are only working within individual categories:

  ddply(bench, ~ Benchmark + Configuration + Approach,
                          transform,
                          Ratio = Time / mean(Time[Approach == "appr2"]) )
#------------
 Time Benchmark Configuration Approach     Ratio
1 399.040 Fibonacci        native    appr1       NaN
2 388.069 Fibonacci        native    appr1       NaN
3 401.072 Fibonacci        native    appr2 1.0516915
4 361.646 Fibonacci        native    appr2 0.9483085

Obviously not what was hoped for. You can calculate a mean value outside of bench to be the normalization factor:

 meanappr2 <- mean(subset(bench, Approach == "appr2", Time))
  ddply(bench, ~ Benchmark + Configuration + Approach,
                          transform,
                          Ratio = Time / meanappr2 )
#--------------
 Time Benchmark Configuration Approach     Ratio
1 399.040 Fibonacci        native    appr1 1.0463631
2 388.069 Fibonacci        native    appr1 1.0175950
3 401.072 Fibonacci        native    appr2 1.0516915
4 361.646 Fibonacci        native    appr2 0.9483085

If on the other hand you didn't want a line by line normalisation but rather a cross group comparison, use the "summarise" option within in the *ply operations:

  ddply(bench, ~ Benchmark + Configuration + Approach,
                          summarise,
                          Ratio = mean(Time) / meanappr2 )
#-----------
  Benchmark Configuration Approach    Ratio
1 Fibonacci        native    appr1 1.031979
2 Fibonacci        native    appr2 1.000000

Looks like I still miss quite a number of basic concepts in R.

The solution lies in the used formula: ~ Benchmark + Configuration + Approach groups the data according to all three dimensions, and that is not what I actually need. The resulting data set did really just contain data of "appr1", and there was noting left to correlate to.

So, changing the forumla to ~ Benchmark + Configuration results in a data set that contains "appr1" and "appr2" data for all Time measurements. And then, it works as intended :)

How to calculate the ratio of data points, i.e., combining them based on some criterion?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？