Extract values from data frame in R
Using R, I would like to find 开发者_如何学JAVAout which Samples (S1, S2, S3, S4, S5) fulfill the following criteria:contain minimally one value (x, y or z) bigger than 4. Thanks, Alex.
Sample x y z <br>
S1 -0.3 5.3 2.5 <br>
S2 0.4 0.2 -1.2 <br>
S3 1.2 -0.6 3.2 <br>
S4 4.3 0.7 5.7 <br>
S5 2.4 4.3 2.3 <br>
You could try a call to apply
- for example:
> apply(dataFrameOfSamples,1,function(x)any(x > 4))
S1 S2 S3 S4 S5
TRUE FALSE FALSE TRUE TRUE
How does this sound? Copy your data into your clipboard and execute the following commands:
dta <- read.table("clipboard", header = T)
apply(dta[2:4], 1, function(x) ifelse(max(x) >= 4, 1, 0))
With many rows this could be more efficient:
do.call(pmax, X[c("x","y","z")]) > 4
On your data
ex <- data.frame(
Sample = c("S1", "S2", "S3", "S4", "S5"),
x = c(-0.3, 0.4, 1.2, 4.3, 2.4),
y = c( 5.3, 0.2,-0.6, 0.7, 4.3),
z = c( 2.5,-1.2, 3.2, 5.7, 2.3)
)
do.call(pmax, ex[c("x","y","z")]) > 4
# [1] TRUE FALSE FALSE TRUE TRUE
Benchmark summary: the pmax
approach could not only be more efficient as @MArek suggest, it is a lot more efficient than the other two. It loses some of it's advantage when a data frame has more columns but it is still the fastest approach.
Benchmark. Being an empiricist I took the liberty of comparing the three approaches. The 3 approaches were compared using microbenchmark
. These are varied characteristics:
- The three approaches here called “pmax” (by @Marek), “apply.any” (by @nullglob), “apply.ifelse” (by @roman-luštrik)
- Size of data frame
- Number of rows: 10, 500, 2500
- Number of columns: 4, 500, 2500
The performance of the pmax
approach is remarkable. It is a lot faster. For the smallest data frame it has an advantage by a factor of > 3. For 2500 rows and 4 columns pmax
is over 97 times faster than “apply.ifelse” and 57 times faster than "apply.any".
The following images shows the performance of the three approaches in relation to pmax
. Hence, the performance of pmax
in each combination of rows and columns is always 1 (i.e. 100%) and the other two approaches are shown in relation to that. It shows that the performance of pmax
is superior especially for the data frames with fewer columns.
Since pmax
seems to lose it's advantage with an increasing number of column, it could be that the other approaches become faster with a large number of columns.
Code used in this post:
library(microbenchmark)
TotalResult <- list()
for (Width in c(4, 500, 1000)) {
for (Size in c(10, 500, 2500)) {
ex <- data.frame(
Sample = paste0("S", 1:Size),
x = runif(Size, 0, 6),
y = runif(Size, 0, 6),
z = runif(Size, 0, 6)
)
if (Width > 4)
for (i in 5:Width)
ex[[i]] <- runif(Size, 0, 6)
result <- microbenchmark(
pmax = { do.call(pmax, ex[2:Width]) > 4 },
apply.ifelse = { apply(ex[2:Width], 1, function(x) ifelse(max(x) > 4, TRUE, FALSE)) },
apply.any = apply(ex[2:Width], 1, function(x) any(x > 4)),
check = "identical"
)
cat("Benchmark: Size =", Size, "// Width =", Width, "\n")
print(result)
#boxplot(result)
TotalResult <- c(TotalResult, list(list(Size=Size, Width=Width, Benchmark=result)))
}
}
Comparison <- data.frame(Approach = character(),
Rows = numeric(),
Columns = numeric(),
Duration = double())
for(test in TotalResult) {
x <- by(test$Benchmark$time, test$Benchmark$expr, median)
Comparison <- rbind(Comparison,
data.frame(
Approach = unlist(attr(x, "dimnames")),
Rows = test$Size, Columns = test$Width,
Duration = unclass(x)
))
}
Comparison$Rows <- as.factor(Comparison$Rows)
Comparison$Columns <- as.factor(Comparison$Columns)
Comparison$Approach <- factor(Comparison$Approach, levels = c("pmax", "apply.any", "apply.ifelse"))
library(ggplot2)
ggplot(data=Comparison, aes(x=Rows, y=Duration, fill=Approach)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~ Columns, strip.position = "bottom") +
theme(strip.placement = "outside") +
scale_fill_brewer(palette="Paired") +
labs(title="Approach Efficiency", x="Size of Data Frame (top: Cols/ bottom: Rows)", y = "Duration µs")
Comparison$RefValue <- Comparison$Duration[rep(seq(1, 25, 3), each=3)]
Comparison$Relative <- Comparison$Duration / Comparison$RefValue
ggplot(data=Comparison, aes(x=Rows, y=Relative, fill=Approach)) +
geom_bar(stat="identity", position=position_dodge()) +
facet_wrap(~ Columns, strip.position = "bottom") +
theme(strip.placement = "outside") +
scale_fill_brewer(palette="Paired") +
labs(title="Relative Efficiency", x="Size of Data Frame (Cols/Rows)", y = "Duration µs")
精彩评论