开发者

How do I create a new data frame based on a table I generated by R?

I get a csv file with thousands of rows and a few columns. Please see the following as an example of what the file is like:

Subject     Duration    
A             1.3   
B             6.7   
C             3.2   
A             2.5   
D             2.7   
E             99    
F             8.4   
G             12.5  
H             19.7  
Z             3.2   
A             56    
B             9.4   
.              .    
.              .    
.              .    

Please notice that for the same subject, the duration might vary. I want to add up the duration for each particular subject, for example, I want to know the total duration for subject A, total duration for subject B, etc. I have so many subject titles that I cannot manually type every single subject and ask for the answer. I want to find out the sum of duration for each subject, and then create a new data frame or a new file which will have the subject name corr开发者_开发问答esponding to the total duration.

Thank you very much in advance!!!!!!


Here's a base version that might work. I borrowed the example from Karsten.

What I actually do is split the data.frame according to subject. This results in a list

split(d, d$subject)

$A
   subject duration
1        A      1.3
4        A      2.5
11       A     56.0

$B
   subject duration
2        B      6.7
12       B      9.4

$C
  subject duration
3       C      3.2

Using lapply, I flip through each list element and sum column duration. I added na.rm = TRUE so that the function still sums up even if NAs are present.

I present this in one line

lapply(split(d, d$subject), function(x) sum(x$duration, na.rm = TRUE))

$A
[1] 59.8

$B
[1] 16.1

$C
[1] 3.2

You can unlist or put the result in a data.frame to transform a list into something more compact.

unlist(lapply(split(d, d$subject), function(x) sum(x$duration, na.rm = TRUE)))
   A    B    C    D    E    F    G    H    Z 
59.8 16.1  3.2  2.7 99.0  8.4 12.5 19.7  3.2 


This is a task the package plyr was invented for

#install.packages("plyr")
library(plyr)
d <- data.frame(
  subject=c("A", "B", "C", "A", "D", "E", "F", "G", "H", "Z", "A", "B"),
  duration=c(1.3, 6.7, 3.2, 2.5, 2.7, 99, 8.4, 12.5, 19.7, 3.2, 56, 9.4)
)
f <- function(df) sum(df$duration)
total_durations <- ddply(d, .(subject), f)

Update

If I understand your question, you wish to add a third column,say total_duration, that contains the sum of all durations for each subject. For this, the merge function is very helpful. Note that I saved the result of the calculation above as new variable total_durations. Now to create a data.frame with three columns, and to write it to a file, do

result <- merge(d,total_durations, by="subject")
write.csv(result, "file.csv", row.names=FALSE)

As for the data types, in the example above, the variables d, total_durations and result are data.frame objects. On the other hand, f is a function which describes what to do with the observations for each subject. Other reasonable definitions for f would be

f <- function(df) nrow(df) # counts the observations per subject
f <- function(df) mean(df$duration) # calculates the mean duration for each subject


You can use the plyr package

ddply(aa, "Subject", summarise, POSITION=sum("Duration"))

where the aa variable is your data.frame

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜