开发者

Extract values from data frame in R

Using R, I would like to find 开发者_如何学JAVAout which Samples (S1, S2, S3, S4, S5) fulfill the following criteria:contain minimally one value (x, y or z) bigger than 4. Thanks, Alex.

 Sample    x    y    z <br>
     S1 -0.3  5.3  2.5 <br>
     S2  0.4  0.2 -1.2 <br>
     S3  1.2 -0.6  3.2 <br>
     S4  4.3  0.7  5.7 <br>
     S5  2.4  4.3  2.3 <br>


You could try a call to apply - for example:

> apply(dataFrameOfSamples,1,function(x)any(x > 4))
   S1    S2    S3    S4    S5
 TRUE FALSE FALSE  TRUE  TRUE


How does this sound? Copy your data into your clipboard and execute the following commands:

dta <- read.table("clipboard", header = T)
apply(dta[2:4], 1, function(x) ifelse(max(x) >= 4, 1, 0))


With many rows this could be more efficient:

do.call(pmax, X[c("x","y","z")]) > 4

On your data

ex <- data.frame(
  Sample = c("S1", "S2", "S3", "S4", "S5"),
  x = c(-0.3, 0.4, 1.2, 4.3, 2.4),
  y = c( 5.3, 0.2,-0.6, 0.7, 4.3),
  z = c( 2.5,-1.2, 3.2, 5.7, 2.3)
)

do.call(pmax, ex[c("x","y","z")]) > 4
# [1]  TRUE FALSE FALSE  TRUE  TRUE


Benchmark summary: the pmax approach could not only be more efficient as @MArek suggest, it is a lot more efficient than the other two. It loses some of it's advantage when a data frame has more columns but it is still the fastest approach.

Benchmark. Being an empiricist I took the liberty of comparing the three approaches. The 3 approaches were compared using microbenchmark. These are varied characteristics:

  • The three approaches here called “pmax” (by @Marek), “apply.any” (by @nullglob), “apply.ifelse” (by @roman-luštrik)
  • Size of data frame
    • Number of rows: 10, 500, 2500
    • Number of columns: 4, 500, 2500

The performance of the pmax approach is remarkable. It is a lot faster. For the smallest data frame it has an advantage by a factor of > 3. For 2500 rows and 4 columns pmax is over 97 times faster than “apply.ifelse” and 57 times faster than "apply.any".

Extract values from data frame in R

The following images shows the performance of the three approaches in relation to pmax. Hence, the performance of pmax in each combination of rows and columns is always 1 (i.e. 100%) and the other two approaches are shown in relation to that. It shows that the performance of pmax is superior especially for the data frames with fewer columns.

Since pmax seems to lose it's advantage with an increasing number of column, it could be that the other approaches become faster with a large number of columns.

Extract values from data frame in R

Code used in this post:

library(microbenchmark)

TotalResult <- list()
for (Width in c(4, 500, 1000)) {
  for (Size in c(10, 500, 2500)) {
    ex <- data.frame(
      Sample = paste0("S", 1:Size),
      x = runif(Size, 0, 6),
      y = runif(Size, 0, 6),
      z = runif(Size, 0, 6)
    )
    if (Width > 4)
      for (i in 5:Width)
        ex[[i]] <- runif(Size, 0, 6)

    result <- microbenchmark(
      pmax = { do.call(pmax, ex[2:Width]) > 4 },
      apply.ifelse = { apply(ex[2:Width], 1, function(x) ifelse(max(x) > 4, TRUE, FALSE)) },
      apply.any = apply(ex[2:Width], 1, function(x) any(x > 4)),
      check = "identical"
    )
    cat("Benchmark: Size =", Size, "// Width =", Width, "\n")
    print(result)
    #boxplot(result)
    TotalResult <- c(TotalResult, list(list(Size=Size, Width=Width, Benchmark=result)))
  }
}


Comparison <- data.frame(Approach = character(),
                         Rows   = numeric(),
                         Columns  = numeric(),
                         Duration = double())
for(test in TotalResult) {
  x <- by(test$Benchmark$time, test$Benchmark$expr, median)
  Comparison <- rbind(Comparison,
                      data.frame(
                        Approach = unlist(attr(x, "dimnames")),
                        Rows = test$Size, Columns = test$Width, 
                        Duration = unclass(x)
                      ))
}
Comparison$Rows <- as.factor(Comparison$Rows)
Comparison$Columns <- as.factor(Comparison$Columns)
Comparison$Approach <- factor(Comparison$Approach, levels = c("pmax", "apply.any", "apply.ifelse"))

library(ggplot2)
ggplot(data=Comparison, aes(x=Rows, y=Duration, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") + 
  labs(title="Approach Efficiency", x="Size of Data Frame (top: Cols/ bottom: Rows)", y = "Duration µs")


Comparison$RefValue <- Comparison$Duration[rep(seq(1, 25, 3), each=3)]
Comparison$Relative <- Comparison$Duration / Comparison$RefValue

ggplot(data=Comparison, aes(x=Rows, y=Relative, fill=Approach)) +
  geom_bar(stat="identity", position=position_dodge()) +
  facet_wrap(~ Columns, strip.position = "bottom") +
  theme(strip.placement = "outside") +
  scale_fill_brewer(palette="Paired") +
  labs(title="Relative Efficiency", x="Size of Data Frame (Cols/Rows)", y = "Duration µs")
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜