Extract values from data frame in R

2023-01-10 14:41 问答作者：

Using R, I would like to find 开发者_如何学JAVAout which Samples (S1, S2, S3, S4, S5) fulfill the following criteria:contain minimally one value (x, y or z) bigger than 4. Thanks, Alex.

 Sample    x    y    z <br>
     S1 -0.3  5.3  2.5 <br>
     S2  0.4  0.2 -1.2 <br>
     S3  1.2 -0.6  3.2 <br>
     S4  4.3  0.7  5.7 <br>
     S5  2.4  4.3  2.3 <br>

You could try a call to apply - for example:

> apply(dataFrameOfSamples,1,function(x)any(x > 4))
   S1    S2    S3    S4    S5
 TRUE FALSE FALSE  TRUE  TRUE

How does this sound? Copy your data into your clipboard and execute the following commands:

dta <- read.table("clipboard", header = T)
apply(dta[2:4], 1, function(x) ifelse(max(x) >= 4, 1, 0))

With many rows this could be more efficient:

do.call(pmax, X[c("x","y","z")]) > 4

On your data

ex <- data.frame(
  Sample = c("S1", "S2", "S3", "S4", "S5"),
  x = c(-0.3, 0.4, 1.2, 4.3, 2.4),
  y = c( 5.3, 0.2,-0.6, 0.7, 4.3),
  z = c( 2.5,-1.2, 3.2, 5.7, 2.3)
)

do.call(pmax, ex[c("x","y","z")]) > 4
# [1]  TRUE FALSE FALSE  TRUE  TRUE

Benchmark summary: the pmax approach could not only be more efficient as @MArek suggest, it is a lot more efficient than the other two. It loses some of it's advantage when a data frame has more columns but it is still the fastest approach.

Benchmark. Being an empiricist I took the liberty of comparing the three approaches. The 3 approaches were compared using microbenchmark. These are varied characteristics:

The three approaches here called “pmax” (by @Marek), “apply.any” (by @nullglob), “apply.ifelse” (by @roman-luštrik)
Size of data frame
- Number of rows: 10, 500, 2500
- Number of columns: 4, 500, 2500

The performance of the pmax approach is remarkable. It is a lot faster. For the smallest data frame it has an advantage by a factor of > 3. For 2500 rows and 4 columns pmax is over 97 times faster than “apply.ifelse” and 57 times faster than "apply.any".

Extract values from data frame in R

The following images shows the performance of the three approaches in relation to pmax. Hence, the performance of pmax in each combination of rows and columns is always 1 (i.e. 100%) and the other two approaches are shown in relation to that. It shows that the performance of pmax is superior especially for the data frames with fewer columns.

Since pmax seems to lose it's advantage with an increasing number of column, it could be that the other approaches become faster with a large number of columns.

Extract values from data frame in R

Code used in this post:

library(microbenchmark)

TotalResult <- list()
for (Width in c(4, 500, 1000)) {
  for (Size in c(10, 500, 2500)) {
    ex <- data.frame(
      Sample = paste0("S", 1:Size),
      x = runif(Size, 0, 6),
      y = runif(Size, 0, 6),
      z = runif(Size, 0, 6)
    )
    if (Width > 4)
      for (i in 5:Width)
        ex[[i]] <- runif(Size, 0, 6)

    result <- microbenchmark(
      pmax = { do.call(pmax, ex[2:Width]) > 4 },
      apply.ifelse = { apply(ex[2:Width], 1, function(x) ifelse(max(x) > 4, TRUE, FALSE)) },
      apply.any = apply(ex[2:Width], 1, function(x) any(x > 4)),
      check = "identical"
    )
    cat("Benchmark: Size =", Size, "// Width =", Width, "\n")
    print(result)
    #boxplot(result)
    TotalResult <- c(TotalResult, list(list(Size=Size, Width=Width, Benchmark=result)))
  }
}


Comparison <- data.frame(Approach = character(),
                         Rows   = numeric(),
                         Columns  = numeric(),
                         Duration = double())
for(test in TotalResult) {
  x <- by(test$Benchmark$time, test$Benchmark$expr, median)
  Comparison <- rbind(Comparison,
                      data.frame(
                        Approach = unlist(attr(x, "dimnames")),
                        Rows = test$Size, Columns = test$Width, 
                        Duration = unclass(x)
                      ))
}
Comparison$Rows <- as.factor(Comparison$Rows)
Comparison$Columns <- as.factor(Comparison$Columns)
Comparison$Approach <- factor(Comparison$Approach, levels = c("pmax", "apply.any", "apply.ifelse"))

library(ggplot2)
ggplot(data=Comparison, aes(x=Rows, y=Duration, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") + 
  labs(title="Approach Efficiency", x="Size of Data Frame (top: Cols/ bottom: Rows)", y = "Duration µs")


Comparison$RefValue <- Comparison$Duration[rep(seq(1, 25, 3), each=3)]
Comparison$Relative <- Comparison$Duration / Comparison$RefValue

ggplot(data=Comparison, aes(x=Rows, y=Relative, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") +
  labs(title="Relative Efficiency", x="Size of Data Frame (Cols/Rows)", y = "Duration µs")

继续阅读：r

Extract values from data frame in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？