Transforming Data Frame in R

2023-02-21 01:56 问答作者：

I开发者_运维问答 have a data frame with multiple variables which in turn have multiple categories. I'll like to take each category and convert them to indicator variables.

V1 V2 V3 V4
xc ab ty ky
xc ab ty kj
xc yi tf kj
cv yi tf kj
bg yt tg kl
bg yu yu kl

convert to

xc cv bg .....
T  F  F......
T  F  F....
T  F  F....
F  T  F....
F  F  T...
F  F  T....

i tried

newframe <- transform(oldframe, xc = to_column(oldframe$V1,'xc'))

where to column is

to_column = function(col, val){
    if (col == val)
        'TRUE'  else
        'FALSE' }

This is one standard approach to creating dummy varaibles from a categorical variable:

model.matrix( ~ V1 - 1, data=df)

df is your data.frame as shown in your question. This returns 0/1 binary as your FALSE/TRUE. Hope that helps!

Best regards,

Jay

Building on @Jay's answer, we have this as a logical matrix.

Logical matrix version:

out <- model.matrix( ~ V1 - 1, data=dat)
out <- matrix(as.logical(out), ncol = ncol(out))
colnames(out) <- with(dat, levels(V1))

> out
        bg    cv    xc
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

All variables at once version:

out2 <- sapply(dat, function(x) model.matrix( ~ x - 1))
out2 <- do.call(cbind, out2)
out2 <- matrix(as.logical(out2), ncol = ncol(out2))
colnames(out2) <- unlist(sapply(dat, levels))

> out2
        bg    cv    xc    ab    yi    yt    yu    tf    tg    ty
[1,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[5,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
        yu    kj    kl    ky
[1,] FALSE FALSE FALSE  TRUE
[2,] FALSE  TRUE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE  TRUE FALSE

If you don't want this as a full matrix like above, then you can stop with the first line, which has all the model matrices in a list, one for each variable (column) in dat, and convert the to a logical. This one-liner does both steps:

> lapply(lapply(dat, function(x) model.matrix( ~ x - 1)),
+        function(x) matrix(as.logical(x), ncol = ncol(x)))
$V1
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

$V2
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE FALSE FALSE FALSE
[2,]  TRUE FALSE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V3
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE  TRUE FALSE
[2,] FALSE FALSE  TRUE FALSE
[3,]  TRUE FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE FALSE
[5,] FALSE  TRUE FALSE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V4
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,]  TRUE FALSE FALSE
[3,]  TRUE FALSE FALSE
[4,]  TRUE FALSE FALSE
[5,] FALSE  TRUE FALSE
[6,] FALSE  TRUE FALSE

And if the variable names are important, then we can modify this to

foo <- function(x) {
    mat <- matrix(as.logical(x), ncol = ncol(x))
    colnames(mat) <- levels(x)
    mat
}
lapply(lapply(dat, function(x) model.matrix( ~ x - 1)), foo)

You could have a look at the reshape package, it provides functionality to pivot data like this. There are examples of its use at the author's homepage

This is quite straightforward with mtabulate from the "qdap" package:

library(qdap)
mtabulate(split(mydf, 1:nrow(mydf))) > 0
#      ab    bg    cv    kj    kl    ky    tf    tg    ty    xc    yi
# 1  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# 2  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
# 3 FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
# 4 FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
# 5 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
# 6 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#      yt    yu
# 1 FALSE FALSE
# 2 FALSE FALSE
# 3 FALSE FALSE
# 4 FALSE FALSE
# 5  TRUE FALSE
# 6 FALSE  TRUE

By default, mtabulate would tabulate the results (surprise!) so the result would be a numeric data.frame. You'll see, for instance, that the count of "yu" in row 6 is actually 2. To get the logical output you desire (just presence/absence), just compare the values obtained from mtabulate with zero.

继续阅读：r

Transforming Data Frame in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？