开发者

Subset a data frame based on column entry (or rank)

I have a data.frame as simple as this one:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  6开发者_运维百科7
3  3     3_3  76

from where I want to retrieve a subset with the first entries of each group; something like:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id is not unique so the approach should not rely on it.

Can I achieve this avoiding loops?

dput() of data:

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))


Using Gavin's million row df:

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

I think the fastest way is to reorder the data frame and then use duplicated:

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.

Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.


Update in light of OP's comment

If doing this on million+ rows, all options thus supplied will be slow. Here are some comparison timings on a dummy data set of 100,000 rows:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
                  group = factor(rep(1:100, each = 1000)),
                  value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
   user  system elapsed 
 19.594   0.053  19.984 
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
   user  system elapsed 
 12.419   0.141  12.788 

I gave up doing them with a million rows. Far faster, believe it or not, is:

out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
               byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
                   levels = levels(DF3$idu))

The outputs are (effectively) the same:

> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"              
[2] "Attributes: < Component 2: target is character, current is numeric >"

(the difference between out1 (or out2) and out3 (the aggregate() version) is just in the rownames of the components.)

with a timing of:

   user  system elapsed 
  0.163   0.001   0.168

on the 100,000 row problem, and on this million row problem:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

with a timing of

   user  system elapsed 
 11.916   0.000  11.925

Working with the matrix version (that produces out2) is quicker doing the million rows that the other versions are at doing the 100,000 row problem. This just shows that working with matrices is very quick indeed, and the bottleneck in the my do.call() version is rbind()-ing the result together.

The million row problem timing was done with:

system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
                                          `[`, 1,)),
                            byrow = TRUE,
                            nrow = (lev <- length(levels(DF3$group))))
             colnames(out4) <- names(DF3)[-4]
             rownames(out4) <- seq_len(lev)
             out4 <- as.data.frame(out4)
             out4$group <- factor(out4$group)
             out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
                                levels = levels(DF3$idu))})

Original

If your data are in DF, say, then:

do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))

will do what you want:

> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
  idu group
1   1     1
2   4     2
3   7     3

If the new data are in DF2 then we get:

> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
  id group idu value
1  1     1 1_1    34
2  4     2 4_2     6
3  1     3 1_3    34

But for speed, we probably want to subset instead of using head() and we can gain a bit by not using with(), eg:

do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))

> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
   user  system elapsed 
  3.847   0.040   4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
   user  system elapsed 
  4.058   0.038   4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
   user  system elapsed 
  3.902   0.042   4.106


One solution using plyr, assuming your data is in an object named zzz:

ddply(zzz, "group", function(x) x[1 ,])

Another option that takes the difference between rows and should prove faster, but relies on the object being ordered before hand. This also assumes you don't have a group value of 0:

zzz <- zzz[order(zzz$group) ,]

zzz[ diff(c(0,zzz$group)) != 0, ]


I think this will do the trick:

aggregate(data["idu"], data["group"], function (x) x[1])

For your updated question, I'd recommend using ddply from the plyr package:

ddply(data, .(group), function (x) x[1,])
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜