Dealing with repetitive tasks in R
I often find myself having to perform repetitive tasks in R. It gets extremely frustrating having to constantly run the same function on one or more data structures over and over again.
For example, let's say I have three separate data frames in R, and I want to delete the rows in each data frame which possess a missing value. With three data frames, it's not all that difficult to run na.omit() on each of the df's, but it can get extremely inefficient when one has one hundred similar data structures which require the same action.
df1 <- data.fr开发者_如何学Pythoname(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
variable=c(2004,2004,2004,2004,2004,2004), value=c(35,20,20,50,30,NA))
df2 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
variable=c(2005,2005,2005,2005,2005,2005), value=c(55,350,40,90,99,NA))
df3 <- data.frame(Region=c("Asia","Africa","Europe","N.America","S.America",NA),
variable=c(2006,2006,2006,2006,2006,2006), value=c(300,200,200,500,300,NA))
tot04 <- na.omit(df1)
tot05 <- na.omit(df2)
tot06 <- na.omit(df3)
What are some general guidelines for dealing with repetitive tasks in R?
Yes, I recognise that the answer to this question is specific to the task that one faces, but I'm just asking about general things that a user should consider when they have a repetitive task.
As a general guideline, if you have several objects that you want to apply the same operations to, you should collect them into one data structure. Then you can use loops, [sl]apply, etc to do the operations in one go. In this case, instead of having separate data frames df1
, df2
, etc, you could put them into a list of data frames and then run na.omit
on all of them:
dflist <- list(df1, df2, <...>)
dflist <- lapply(dflist, na.omit)
Besides @Hong Ooi answer I suggest looking into packages plyr and reshape. In your case following might be useful:
df1$name <- "var1"
df2$name <- "var2"
df3$name <- "var3"
df <- rbind(df1,df2,df3)
df <- na.omit(df)
##Get various means:
> ddply(df,~name,summarise,AvgName=mean(value))
name AvgName
1 var1 31.0
2 var2 126.8
3 var3 300.0
> ddply(df,~Region,summarise,AvgRegion=mean(value))
Region AvgRegion
1 Africa 190.00000
2 Asia 130.00000
3 Europe 86.66667
4 N.America 213.33333
5 S.America 143.00000
> ddply(df,~variable,summarise,AvgVar=mean(value))
variable AvgVar
1 2004 31.0
2 2005 126.8
3 2006 300.0
##Transform the data.frame into another format
> cast(Region+variable~name,data=df)
Region variable var1 var2 var3
1 Africa 2004 20 NA NA
2 Africa 2005 NA 350 NA
3 Africa 2006 NA NA 200
4 Asia 2004 35 NA NA
5 Asia 2005 NA 55 NA
6 Asia 2006 NA NA 300
7 Europe 2004 20 NA NA
8 Europe 2005 NA 40 NA
9 Europe 2006 NA NA 200
10 N.America 2004 50 NA NA
11 N.America 2005 NA 90 NA
12 N.America 2006 NA NA 500
13 S.America 2004 30 NA NA
14 S.America 2005 NA 99 NA
15 S.America 2006 NA NA 300
If the names are similar you could iterate over them using the pattern
argument to ls
:
for (i in ls(pattern="df")){
assign(paste("t",i,sep=""),na.omit(get(i)))
}
However, a more "R" way of doing it seems to be to use separate environment and eapply
:
# setup environment
env <- new.env()
# copy dataframes across (using common pattern)
for (i in ls(pattern="df")){
asssign(i,get(i),envir=env)
}
# apply function on environment
eapply(env,na.omit)
Which yields:
$df3
Region variable value
1 Asia 2006 300
2 Africa 2006 200
3 Europe 2006 200
4 N.America 2006 500
5 S.America 2006 300
$df2
Region variable value
1 Asia 2005 55
2 Africa 2005 350
3 Europe 2005 40
4 N.America 2005 90
5 S.America 2005 99
$df1
Region variable value
1 Asia 2004 35
2 Africa 2004 20
3 Europe 2004 20
4 N.America 2004 50
5 S.America 2004 30
Unfortunately, this is one huge list so getting this out as seperate objects is a little tricky. Something on the lines of:
lapply(eapply(env,na.omit),function(x) assign(paste("t",substitute(x),sep=""),x,envir=.GlobalEnv))
should work, but the substitute
is not picking out the list element names properly.
精彩评论