Subset a data frame based on column entry (or rank)
I have a data.frame as simple as this one:
id group idu value
1 1 1_1 34
2 1 2_1 23
3 1 3_1 67
4 2 4_2 6
5 2 5_2 24
6 2 6_2 45
1 3 1_3 34
2 3 2_3 6开发者_运维百科7
3 3 3_3 76
from where I want to retrieve a subset with the first entries of each group; something like:
id group idu value
1 1 1_1 34
4 2 4_2 6
1 3 1_3 34
id is not unique so the approach should not rely on it.
Can I achieve this avoiding loops?
dput()
of data:
structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L,
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3",
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L,
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group",
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))
Using Gavin's million row df:
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
group = factor(rep(1:1000, each = 1000)),
value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
I think the fastest way is to reorder the data frame and then use duplicated
:
system.time({
DF4 <- DF3[order(DF3$group), ]
out2 <- DF4[!duplicated(DF4$group), ]
})
# user system elapsed
# 0.335 0.107 0.441
This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.
Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.
Update in light of OP's comment
If doing this on million+ rows, all options thus supplied will be slow. Here are some comparison timings on a dummy data set of 100,000 rows:
set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
group = factor(rep(1:100, each = 1000)),
value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
user system elapsed
19.594 0.053 19.984
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
user system elapsed
12.419 0.141 12.788
I gave up doing them with a million rows. Far faster, believe it or not, is:
out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
levels = levels(DF3$idu))
The outputs are (effectively) the same:
> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"
[2] "Attributes: < Component 2: target is character, current is numeric >"
(the difference between out1
(or out2
) and out3
(the aggregate()
version) is just in the rownames of the components.)
with a timing of:
user system elapsed
0.163 0.001 0.168
on the 100,000 row problem, and on this million row problem:
set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
group = factor(rep(1:1000, each = 1000)),
value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
with a timing of
user system elapsed
11.916 0.000 11.925
Working with the matrix version (that produces out2
) is quicker doing the million rows that the other versions are at doing the 100,000 row problem. This just shows that working with matrices is very quick indeed, and the bottleneck in the my do.call()
version is rbind()
-ing the result together.
The million row problem timing was done with:
system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
`[`, 1,)),
byrow = TRUE,
nrow = (lev <- length(levels(DF3$group))))
colnames(out4) <- names(DF3)[-4]
rownames(out4) <- seq_len(lev)
out4 <- as.data.frame(out4)
out4$group <- factor(out4$group)
out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
levels = levels(DF3$idu))})
Original
If your data are in DF
, say, then:
do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
will do what you want:
> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
idu group
1 1 1
2 4 2
3 7 3
If the new data are in DF2
then we get:
> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
id group idu value
1 1 1 1_1 34
2 4 2 4_2 6
3 1 3 1_3 34
But for speed, we probably want to subset instead of using head()
and we can gain a bit by not using with()
, eg:
do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
user system elapsed
3.847 0.040 4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
user system elapsed
4.058 0.038 4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
user system elapsed
3.902 0.042 4.106
One solution using plyr
, assuming your data is in an object named zzz
:
ddply(zzz, "group", function(x) x[1 ,])
Another option that takes the difference between rows and should prove faster, but relies on the object being ordered before hand. This also assumes you don't have a group value of 0:
zzz <- zzz[order(zzz$group) ,]
zzz[ diff(c(0,zzz$group)) != 0, ]
I think this will do the trick:
aggregate(data["idu"], data["group"], function (x) x[1])
For your updated question, I'd recommend using ddply
from the plyr
package:
ddply(data, .(group), function (x) x[1,])
精彩评论