开发者

R: Stacking Multiple Punch Question Data

Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).

So, I have one data.frame with 2 columns for this question:

df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")

And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.

So, I have another data.frame with 4 columns for this question:

df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))

colnames(df.attribute) <- c(
"Attribute1.Company1", 
"Attribute2.Company1", 
"Attribute1.Company2", 
"Attribute2.Company2") 

Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.

So, I start off by binding the two questions together:

df <- cbind(df.recommend, df.attribute)

My problem is trying to figure out how to stack these data such that the columns look something like:

df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2), 
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")

This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.

Could you think of a way to stack them effectively, without having to type out all the c() statements?

Note: The column pattern for likelyhood is Co1,Co2,Co开发者_运维技巧3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...


For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).

Here is what I believe to be a complete solution in about a dozen lines of code.

First, create some data

library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)

# Create data frame
# Important: note the addition of a respondent id column

df_comp <- data.frame(
        RespID = 1:10,
        Company1 = rep(1:5, 2),
        Company2 = rep(1:5, 2)
)

df_attr <- data.frame(
        RespID = 1:10,
        Attribute1.Company1 = rep(0:1,5),
        Attribute2.Company1 = rep(1:0,5),
        Attribute1.Company2 = rep(0:1,5),
        Attribute2.Company2 = rep(1:0,5)
)

Now start the data manipulation:

# Use melt to convert data from wide to tall

melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")

# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped

# EDIT 2:  reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp")) 
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL

# Use cast to convert from tall to somewhat tall

cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)


# Combine data frames using join() in package plyr

df <- join(melt_comp, cast_attr)
head(df)

And the output:

  RespID     comp likelihood Attribute1 Attribute2
1      1 Company1          1          0          1
2      2 Company1          2          1          0
3      3 Company1          3          0          1
4      4 Company1          4          1          0
5      5 Company1          5          0          1
6      6 Company1          1          1          0


Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values

df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")

df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))

colnames(df.attribute) <- c(
"At1.Co1", 
"At2.Co1", 
"At1.Co2", 
"At2.Co2") 


df.stacked <- data.frame(
    likelihood <- unlist(df.recommend)
    )
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))

for (i in 1:length(atts)) 
{
    df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}

names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))

EDIT: It assumes that companies are in the same order for each attribute

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜