How do I create a new data frame based on a table I generated by R?
I get a csv file with thousands of rows and a few columns. Please see the following as an example of what the file is like:
Subject Duration
A 1.3
B 6.7
C 3.2
A 2.5
D 2.7
E 99
F 8.4
G 12.5
H 19.7
Z 3.2
A 56
B 9.4
. .
. .
. .
Please notice that for the same subject, the duration might vary. I want to add up the duration for each particular subject, for example, I want to know the total duration for subject A, total duration for subject B, etc. I have so many subject titles that I cannot manually type every single subject and ask for the answer. I want to find out the sum of duration for each subject, and then create a new data frame or a new file which will have the subject name corr开发者_开发问答esponding to the total duration.
Thank you very much in advance!!!!!!
Here's a base version that might work. I borrowed the example from Karsten.
What I actually do is split the data.frame
according to subject
. This results in a list
split(d, d$subject)
$A
subject duration
1 A 1.3
4 A 2.5
11 A 56.0
$B
subject duration
2 B 6.7
12 B 9.4
$C
subject duration
3 C 3.2
Using lapply
, I flip through each list element and sum column duration
. I added na.rm = TRUE
so that the function still sums up even if NAs are present.
I present this in one line
lapply(split(d, d$subject), function(x) sum(x$duration, na.rm = TRUE))
$A
[1] 59.8
$B
[1] 16.1
$C
[1] 3.2
You can unlist
or put the result in a data.frame
to transform a list into something more compact.
unlist(lapply(split(d, d$subject), function(x) sum(x$duration, na.rm = TRUE)))
A B C D E F G H Z
59.8 16.1 3.2 2.7 99.0 8.4 12.5 19.7 3.2
This is a task the package plyr
was invented for
#install.packages("plyr")
library(plyr)
d <- data.frame(
subject=c("A", "B", "C", "A", "D", "E", "F", "G", "H", "Z", "A", "B"),
duration=c(1.3, 6.7, 3.2, 2.5, 2.7, 99, 8.4, 12.5, 19.7, 3.2, 56, 9.4)
)
f <- function(df) sum(df$duration)
total_durations <- ddply(d, .(subject), f)
Update
If I understand your question, you wish to add a third column,say total_duration
, that contains the sum of all durations for each subject. For this, the merge
function is very helpful. Note that I saved the result of the calculation above as new variable total_durations
. Now to create a data.frame with three columns, and to write it to a file, do
result <- merge(d,total_durations, by="subject")
write.csv(result, "file.csv", row.names=FALSE)
As for the data types, in the example above, the variables d
, total_durations
and result
are data.frame
objects. On the other hand, f
is a function which describes what to do with the observations for each subject. Other reasonable definitions for f
would be
f <- function(df) nrow(df) # counts the observations per subject
f <- function(df) mean(df$duration) # calculates the mean duration for each subject
You can use the plyr package
ddply(aa, "Subject", summarise, POSITION=sum("Duration"))
where the aa variable is your data.frame
精彩评论