开发者

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.

This is the basic data frame:

foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A',开发者_开发百科 'A', 'B', 'B', 'C', 'C'))

I use the following function to find out the frequencies of var2 for specified values of var1.

foobar <- function(x, y, z){
  a <- subset(x, (x$var1 == y))
  b <- subset(a, (a$var2 == z))
  n=nrow(b)
  return(n)
}

Examples:

foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1

This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)

df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))

When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?

edit1: last paragraph clarified

edit2: var1 type corrected


Try this:

library(plyr)

match_df <- function(x, match) {
  vars <- names(match)

  # Create unique id for each row
  x_id <- id(match[vars])
  match_id <- id(x[vars])

  # Match identifiers and return subsetted data frame
  x[match(x_id, match_id, nomatch = 0), ]
}


match_df(foo, df)
#   var1 var2
# 1    1    A
# 3    1    B
# 5    2    C


Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:

apply(df, 1, function(x) foobar(foo, x[1], x[2]))

And in use:

> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1

To respond to your edit:

I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.

What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).

First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:

> length(df[, "var1"])
[1] 3

The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.

Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:

> str(df)
'data.frame':   3 obs. of  2 variables:
 $ var1: Factor w/ 2 levels "1","3": 1 1 2
 $ var2: Factor w/ 3 levels "A","B","C": 1 2 3

Versus

> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame':   3 obs. of  2 variables:
 $ var1: num  1 1 3
 $ var2: Factor w/ 3 levels "A","B","C": 1 2 3

In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.


foobar2 <- function(x, df) {
  .dofun <- function(y, z){
    a <- subset(x, x$var1==y)
    b <- subset(a, a$var2==z)
    n <- nrow(b)
    return (n)
  }
  ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
  names(ans) <- NULL
  return(ans)
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜