开发者

Transforming Data Frame in R

I开发者_运维问答 have a data frame with multiple variables which in turn have multiple categories. I'll like to take each category and convert them to indicator variables.

V1 V2 V3 V4
xc ab ty ky
xc ab ty kj
xc yi tf kj
cv yi tf kj
bg yt tg kl
bg yu yu kl

convert to

xc cv bg .....
T  F  F......
T  F  F....
T  F  F....
F  T  F....
F  F  T...
F  F  T....

i tried

newframe <- transform(oldframe, xc = to_column(oldframe$V1,'xc')) 

where to column is

to_column = function(col, val){
    if (col == val)
        'TRUE'  else
        'FALSE' }


This is one standard approach to creating dummy varaibles from a categorical variable:

model.matrix( ~ V1 - 1, data=df)

df is your data.frame as shown in your question. This returns 0/1 binary as your FALSE/TRUE. Hope that helps!

Best regards,

Jay


Building on @Jay's answer, we have this as a logical matrix.

Logical matrix version:

out <- model.matrix( ~ V1 - 1, data=dat)
out <- matrix(as.logical(out), ncol = ncol(out))
colnames(out) <- with(dat, levels(V1))

> out
        bg    cv    xc
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

All variables at once version:

out2 <- sapply(dat, function(x) model.matrix( ~ x - 1))
out2 <- do.call(cbind, out2)
out2 <- matrix(as.logical(out2), ncol = ncol(out2))
colnames(out2) <- unlist(sapply(dat, levels))

> out2
        bg    cv    xc    ab    yi    yt    yu    tf    tg    ty
[1,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[5,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
        yu    kj    kl    ky
[1,] FALSE FALSE FALSE  TRUE
[2,] FALSE  TRUE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE  TRUE FALSE

If you don't want this as a full matrix like above, then you can stop with the first line, which has all the model matrices in a list, one for each variable (column) in dat, and convert the to a logical. This one-liner does both steps:

> lapply(lapply(dat, function(x) model.matrix( ~ x - 1)),
+        function(x) matrix(as.logical(x), ncol = ncol(x)))
$V1
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

$V2
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE FALSE FALSE FALSE
[2,]  TRUE FALSE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V3
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE  TRUE FALSE
[2,] FALSE FALSE  TRUE FALSE
[3,]  TRUE FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE FALSE
[5,] FALSE  TRUE FALSE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V4
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,]  TRUE FALSE FALSE
[3,]  TRUE FALSE FALSE
[4,]  TRUE FALSE FALSE
[5,] FALSE  TRUE FALSE
[6,] FALSE  TRUE FALSE

And if the variable names are important, then we can modify this to

foo <- function(x) {
    mat <- matrix(as.logical(x), ncol = ncol(x))
    colnames(mat) <- levels(x)
    mat
}
lapply(lapply(dat, function(x) model.matrix( ~ x - 1)), foo)


You could have a look at the reshape package, it provides functionality to pivot data like this. There are examples of its use at the author's homepage


This is quite straightforward with mtabulate from the "qdap" package:

library(qdap)
mtabulate(split(mydf, 1:nrow(mydf))) > 0
#      ab    bg    cv    kj    kl    ky    tf    tg    ty    xc    yi
# 1  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# 2  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
# 3 FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
# 4 FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
# 5 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
# 6 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#      yt    yu
# 1 FALSE FALSE
# 2 FALSE FALSE
# 3 FALSE FALSE
# 4 FALSE FALSE
# 5  TRUE FALSE
# 6 FALSE  TRUE

By default, mtabulate would tabulate the results (surprise!) so the result would be a numeric data.frame. You'll see, for instance, that the count of "yu" in row 6 is actually 2. To get the logical output you desire (just presence/absence), just compare the values obtained from mtabulate with zero.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜