grouping dataframe rows by factor and by function - output complete original dataframe row
My first post and I'm very new to R so this may be a lob. I have search all over for a solution though, so I'm finally posting for help. Let me know if I need to clarify or provide more information.
I have a large dataframe that looks like the following:
numReads length name2
0 7384 Ssxb2
7904 93237 St5
3438 12969 Taf9b
0 996 Tas2r138
0 882 Tas2r143
0 960 Tas2r144
0 6761 Tbx10
8125 43804 Tdrd1
8124 43738 Tdrd1
8102 39301 Tdrd1
1227 9286 Thnsl1
How can I group the data by the third column (name2), find the max()
value for numReads, and maintain the associated length value?
My ideal output would be the above data with the two lines associated with "Tdrd1" that DO NOT contain the max value for that factor level (the lines with the 8124 and 8102 values).
I have tried tapply()
, by()
, and aggregate()
. None of them can provide me with the proper output.
Thanks in advance.
Edit after comments that came FAR faster than expected. Thank you!
Ideal example results would look like the following
numReads length name2
0 7384 Ssxb2
7904 93237 St5
3438 12969 Taf9b
0 996 Tas2r138
0 882 Tas2r143
0 960 Tas2r144
0 6761 Tbx10
8125 43804 Tdrd1
1227 9286 Thnsl1
So it does seem like I have two questions here. The first is to group the data based on a factor. The second is how to calculate a function on the group, but output the entire row after calculating the chosen function.
I like the idea of an aggregate() followed by a merge(). But how will the merge() function know WHICH row of the original rows from which to grab the 'length' value based on a common factor level?
The data is a snapshot of gene expression data based on transcript annotations. I am trying to select the highest expressed transcript ( in terms of numReads) for an associated 'name2.' I need the length data for downstream normalization.
EDIT after trying to use the very helpful suggestion by ROLO. Thanks again!
also thank you Chase and daroczig for help as well
So I am trying to use the ddply() approach to split my dataframe by 'name2', sort by the number of reads in decreasing order, and selecting the top row. This effectively gives me the max 'name2' value of each group and maintains all my original information, especially the length.
Unfortunately, I'm trying to d开发者_运维百科o this on a dataframe with >34,000 rows. It works fine for ~1000 rows, and even ~5000 rows, but crashes when I give it my whole dataset.
I've trying to use the .parallel option but it fails with the following error:
Loading required package: foreach
Error: foreach package required for parallel plyr operation
I've also tried to monitor operation with the .progressbar option as well. the progress bar makes it to 100%, but the operation never finishes.
Any ideas on how to apply this operation to my complete dataset?
Use plyr
to split on name2, then reverse sort numReads and select the first row:
require(plyr)
ddply(df, "name2", function(dat) {
dat[order(dat$numReads, decreasing=TRUE), ][1,]
})
numReads length name2
1 0 7384 Ssxb2
2 7904 93237 St5
3 3438 12969 Taf9b
4 0 996 Tas2r138
5 0 882 Tas2r143
6 0 960 Tas2r144
7 0 6761 Tbx10
8 8125 43804 Tdrd1
9 1227 9286 Thnsl1
I might not get what you are after exactly, but I think you want to get the rows from the database which have the highest value in numReads
per level of name2
. This could be done easily eg. with aggregate
and later merge
.
Your demo dataset:
df <- structure(list(numReads = c(0L, 7904L, 3438L, 0L, 0L, 0L, 0L,
8125L, 8124L, 8102L, 1227L), length = c(7384L, 93237L, 12969L,
996L, 882L, 960L, 6761L, 43804L, 43738L, 39301L, 9286L), name2 = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L), .Label = c("Ssxb2",
"St5", "Taf9b", "Tas2r138", "Tas2r143", "Tas2r144", "Tbx10",
"Tdrd1", "Thnsl1"), class = "factor")), .Names = c("numReads",
"length", "name2"), class = "data.frame", row.names = c(NA, -11L
))
Let us aggregate the data frame by name2
with max
function:
> df.a <- aggregate(numReads ~ name2, df, max)
> df.a
name2 numReads
1 Ssxb2 0
2 St5 7904
3 Taf9b 3438
4 Tas2r138 0
5 Tas2r143 0
6 Tas2r144 0
7 Tbx10 0
8 Tdrd1 8125
9 Thnsl1 1227
And merge the original values of length
to the data frame (df.a
):
> merge(df.a, df)
name2 numReads length
1 Ssxb2 0 7384
2 St5 7904 93237
3 Taf9b 3438 12969
4 Tas2r138 0 996
5 Tas2r143 0 882
6 Tas2r144 0 960
7 Tbx10 0 6761
8 Tdrd1 8125 43804
9 Thnsl1 1227 9286
I hope I did not misunderstood your question!
There are seemingly two different questions here. The first can be solved with the plyr
package:
library(plyr)
txt <- "numReads length name2
0 7384 Ssxb2
7904 93237 St5
3438 12969 Taf9b
0 996 Tas2r138
0 882 Tas2r143
0 960 Tas2r144
0 6761 Tbx10
8125 43804 Tdrd1
8124 43738 Tdrd1
8102 39301 Tdrd1
1227 9286 Thnsl1
"
dat <- read.table(textConnection(txt), header = TRUE)
ddply(dat, "name2", summarize, max = max(numReads))
Gives you:
name2 max
1 Ssxb2 0
2 St5 7904
3 Taf9b 3438
4 Tas2r138 0
5 Tas2r143 0
6 Tas2r144 0
7 Tbx10 0
8 Tdrd1 8125
9 Thnsl1 1227
The second question can seemingly be answered with:
dat[dat$name2 == "Tdrd1" & dat$numReads != max(dat$numReads[dat$name2 == "Tdrd1"]),]
numReads length name2
9 8124 43738 Tdrd1
10 8102 39301 Tdrd1
Provide some more context on what you're trying to do and I'll elaborate further.
精彩评论