How can I structure and recode messy categorical data in R?

2022-12-29 18:12 问答作者：

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean.

The Coding Scheme

I'm analyzing data from a university science course exam. We're looking at patterns in student responses, and we developed a coding scheme to represent the kinds of things students are doing in their answers. A subset of the coding scheme is shown below.

Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...).

What the Raw Data Looks Like

I've created an anonymized, raw subset of my actual data which you can view here. Part of my problem is that those who coded the data noticed that some students displayed multiple patterns. The coders' solution was to create enough columns (reason1, reason2, ...) to hold students with multiple patterns. That becomes important because the order (reason1, reason2) is arbitrary--two students (like student 41 and student 42 in my dataset) who correctly applied "dependency" should both register in an analysis, regardless of whether 3a appears in the reason column or the reason2 column.

How Can I Best Structure Student Data?

Part of my problem is that in the raw data, not all students display the same patterns, or the same number of them, in the same order. Some students may do just one thing, others may do several. So, an abstracted representation of example students might look like this:

How can I structure and recode messy categorical data in R?

Note in the example above that student002 and student003 both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of my data.

My (Practical) Questions

Should I concatenate reason开发者_运维知识库1, reason2, ... into one column?
How can I (re)code the reasons in R to reflect the multiplicity for some students?

Thanks

I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. If I haven't been specific enough, please let me know and I'll do my best to be clearer.

Make it "long":

library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results

What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.

you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. You'll need to clean up some of the question marks and extra stuff first though.

x <- ddply(data, c("split_column1", "split_column3" etc),
           summarize(result_df, stats you want from result_df))

What's the (bigger picture) question you're attempting to answer? Why is this information interesting to you?

Are you just trying to find patterns such as 'if the student does this, then they also likely do this'?

Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives.

Interesting problem though!

继续阅读：plyr r statistics

How can I structure and recode messy categorical data in R?

The Coding Scheme

What the Raw Data Looks Like

How Can I Best Structure Student Data?

My (Practical) Questions

Thanks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The Coding Scheme

What the Raw Data Looks Like

How Can I Best Structure Student Data?

My (Practical) Questions

Thanks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？