开发者

Apply multiple functions to each row of a dataframe

Every time I think I understand about working with vectors, what appears to be a simple problem turns my head inside out. Lot's of reading and trying different examples hasn't helped on this occasion. Please spoon feed me here...

I want to apply two custom functions to each row of a dataframe and add the results as a two new columns. Here is my sample code:

# Required packages:
library(plyr)

FindMFE <- function(x) {
    MFE <- max(x, na.rm = TRUE) 
    MFE <- ifelse(is.infinite(MFE ) | (MFE  < 0), 0, MFE)
    return(MFE)
}

FindMAE <- function(x) {
    MAE <- min(x, na.rm = TRUE) 
    MAE <- ifelse(is.infinite(MAE) | (MAE> 0), 0, MAE)
    return(MAE)
}

FindMAEandMFE <- function(x){
        # I know this next line is wrong...
    z <- apply(x, 1, FindMFE, FindMFE)
        return(z)
}

df1 <- data.frame(Bar1=c(1,2,3,-3,-2,-1),Bar2=c(3,1,3,-2,-3,-1))

df1 = transform(df1, 
    FindMAEandMFE(df1)  
)

#DF1 should end up with the following data...
#Bar1   Bar2    MFE MAE
#1      3       3   0
#2      1       2   0
#3      3       3   0
#-3     -2      0   -3
#-2     -3      0   -3
#-1     -1      0   -1

It开发者_如何学Go would be great to get an answer using the plyr library and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious. ;-)

Now back to the help files for me!

Edit: I would like a multivariate solution as column names may change and expand over time. It also allows re-use of the code in future.


I think you are thinking too complex here. What is wrong with two separate apply() calls? There is however a far better way to do what you are doing here that involves no looping/apply calls. I'll deal with these separately, but the second solution is preferable as it is truly vectorised.

Two apply calls version

First two separate apply calls using all-Base R functions:

df1 <- data.frame(Bar1=c(1,2,3,-3,-2,-1),Bar2=c(3,1,3,-2,-3,-1))
df1 <- transform(df1, MFE = apply(df1, 1, FindMFE), MAE = apply(df1, 1, FindMAE))
df1

Which gives:

> df1
  Bar1 Bar2 MFE MAE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1

Ok, looping over the rows of df1 twice is perhaps a little inefficient, but even for big problems you've spent more time already thinking about doing this cleverly in a single pass than you will save by doing that way.

Using vectorised functions pmax() and pmin()

So a better way of doing this is to note the pmax() and pmin() functions and realise that they can do what each the apply(df1, 1, FindFOO() calls were doing. For example:

> (tmp <- with(df1, pmax(0, Bar1, Bar2, na.rm = TRUE)))
[1] 3 2 3 0 0 0

would be MFE from your Question. This is very simple to work with if you have two columns and they are Bar1 and Bar2 or the first 2 columns of df1, always. But it is not very general; what if you have multiple columns you want to compute this over etc? pmax(df1[, 1:2], na.rm = TRUE) won't do what we want:

> pmax(df1[, 1:2], na.rm = TRUE)
  Bar1 Bar2
1    1    3
2    2    1
3    3    3
4   -3   -2
5   -2   -3
6   -1   -1

The trick to getting a general solution using pmax() and pmin() is to use do.call() to arrange the calls to those two functions for us. Updating your functions to use this idea we have:

FindMFE2 <- function(x) {
   MFE <- do.call(pmax, c(as.list(x), 0, na.rm = TRUE))
   MFE[is.infinite(MFE)] <- 0
   MFE
}

FindMAE2 <- function(x) {
   MAE <- do.call(pmin, c(as.list(x), 0, na.rm = TRUE))
   MAE[is.infinite(MAE)] <- 0
   MAE
}

which give:

> transform(df1, MFE = FindMFE2(df1), MAE = FindMAE2(df1))
  Bar1 Bar2 MFE MAE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1

and not an apply() in sight. If you want to do this in a single step, this is now much easier to wrap:

FindMAEandMFE2 <- function(x){
    cbind(MFE = FindMFE2(x), MAE = FindMAE2(x))
}

which can be used as:

> cbind(df1, FindMAEandMFE2(df1))
  Bar1 Bar2 MFE MAE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1


I show three alternative one-liners:

  • Using the each function of plyr
  • Using the plyr each function with base R
  • Using the pmin and pmax functions that are vectorise

Solution 1: plyr and each

The plyr package defines the each function that does what you want. From ?each: Aggregate multiple functions into a single function. This means you can solve your problem using a one-liner:

library(plyr)
adply(df1, 1, each(MAE=function(x)max(x, 0), MFE=function(x)min(x, 0)))

  Bar1 Bar2 MAE MFE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1

Solution 2: each and base R

You can, of course, use each with base functions. Here is how you can use it with apply - just note that you have to transpose the results before adding to your original data.frame.

library(plyr)
data.frame(df1, 
  t(apply(df1, 1, each(MAE=function(x)max(x, 0), MFE=function(x)min(x, 0)))))

  Bar1 Bar2 MAE MFE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1

Solution 3: using vectorised functions

Using vectorised functions pmin and pmax, you can use this one-liner:

transform(df1, MFE=pmax(0, Bar1, Bar2), MAE=pmin(0, Bar1, Bar2))

  Bar1 Bar2 MFE MAE
1    1    3   3   0
2    2    1   2   0
3    3    3   3   0
4   -3   -2   0  -3
5   -2   -3   0  -3
6   -1   -1   0  -1


There are lots of good answers here. I started this while Gavin Simpson was editing so we cover some similar ground. What the parallel min and max do (pmin and pmax) is pretty much exactly what you're writing your functions for. It may be a little opaque what the 0 does in pmax(0, Bar1, Bar2) but essentially 0 gets recycled so that's it's like doing

pmax(c(0,0,0,0,0,0), Bar1, Bar2)

That will take each item of the three things passed and find the max of them. So, the max will be 0 if it was negative and accomplishes much of what your ifelse statement did. You could rewrite so you get vectors and combine things with functions similar to what you were doing and that might make it a bit more transparent. In this case we'd just pass the dataframe to a new parallel and fast findMFE function that will work with any numeric dataframe and get out a vector.

findMFE <- function(dataf){
    MFE <- do.call( pmax, c(dataf, 0, na.rm = TRUE))
}

MFE <- findMFE(df1)

What this function does is add an extra column of 0s to the passed data frame and then call pmax passing each separate column of df1 as if it were a list (dataframes are lists so this is easy).

Now, I note that you actually want to correct for Inf values in your data that aren't in your example... we could add an extra line to your function...

findMFE <- function(dataf){
    MFE <- do.call( pmax, c(dataf, 0, na.rm = TRUE))
    ifelse(is.infinite(MFE), 0, MFE)
}

Now, that's proper use of the ifelse() function on a vector. I did it that way as an example for you but Gavin Simpson's use of MFE[is.infinite(MFE)] <- 0 is more efficient. Note that this findMFE function isn't used in a loop, it's just passed the whole data frame.

The comparable findMAE is...

findMAE <- function(dataf){
    MAE <- do.call( pmin, c(dataf, 0, na.rm = TRUE))
    ifelse(is.infinite(MAE), 0, MAE)
}

and the combined function is simply...

findMFEandMAE <- function(dataf){
    MFE <- findMFE(dataf)
    MAE <- findMAE(dataf)
    return(data.frame(MFE, MAE))
}

MFEandMAE <- findMFEandMAE(df1) df1 <- cbind(df1, MFEandMAE)

Some tips

If you've got a scalar if statement don't use ifelse(), use if() else. It's much faster in scalar situations. And, your functions are scalar and you're trying to vectorize them. ifelse() is already vectorized and runs very fast when used that way but much slower than if() else when used scalar.

Also, if you're going to be putting stuff in a loop or apply statement put as little in there as possible. For example, in your case the ifelse() really needed to be taken out of the loop and applied to the whole MFE result afterwards.


If you really, really want it, you can:

FindMAEandMFE <- function(x){
    t(apply(x, 1, function(currow){c(MAE=FindMAE(currow), MFE=FindMFE(currow))}))
}

(not tested - it should return an array with two (named, I think) columns and as many rows as the data.frame had). Now you can do:

df1<-cbind(df1, FindMAEandMFE(df1))

Very icky. Please heed Gavin's advice.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜