开发者

assigning a factor to a data frame

I want to add a column to a data frame which will encode the specific levels of a factor. e.g.

subject  rate
1          12
1          10 
1          13
4          4
4          6
4          12
2          9
2          2
2          5
6          17
6          开发者_开发技巧10
6          1

in the above data frame I wish add a third column called "treatment" where subjects are assigned to one of two levels "a" or "b". e.g. below

subject  rate  treatment
1          12      a
1          10      a
1          13      a
4          4       b
4          6       b
4          12      b
2          9       b
2          2       b
2          5       b 
6          17      a
6          10      a
6          1       a  

Thanks in advance for any help.


Here's another approach using the plyr package:

library(plyr)

#Make some fake data
set.seed(1)
dat <- data.frame(subject = rep(c(1,4,2,6), each = 3), rate = sample(1:20, 12, TRUE))

set.seed(1)
#Assign treatment based on the subject ID. This does not ensure that you will get
#at least one subject in each treatment group.
ddply(dat, "subject", transform, treatment = sample(letters[1:2], TRUE))

EDIT - to address your comment

Given that you want to specify which subject gets assigned to which treatment, Gavin's suggestion of merge is spot on. I would first make a new data.frame that contains one record for each unique subject, assign their treatment, and then merge them together:

treatments <- data.frame(subject = unique(dat$subject), treats = c("a", "b", "b", "a"))
merge(dat, treatments)

Note that the order of unique(dat$subject) is 1,4,2,6 which corresponds to the order of the values in the original data.frame. If your real problem contains more than four subjects, you may want to consider a more automated way of assigning treatments groups. One approach I've used in the past is to assign a random number to each respondent, and then assign groups based on a given threshold of that random number. It is essentially the same as the approach above, but can ensure that you get equal numbers in each group. For example:

dat <- ddply(dat, "subject", transform, treatment = runif(1))
dat <- within(dat, treatment <- ifelse(treatment < quantile(treatment, 0.5),"a", "b"))


If you want to assign treatments at random, this will do it:

## subject IDs
subj <- with(dat, unique(subject))

## how many treatment levels?
ntreat <- 2

## sample an identifier for the treaments
set.seed(47)
treats <- sample(letters[seq_len(ntreat)], length(subj), replace = TRUE)

## stick this into a subject/treatment data frame
Treat <- data.frame(cbind(subject = subj, treatment = treats))

This gives:

R> Treat
  subject treatment
1       1         b
2       4         a
3       2         b
4       6         b

Edit:

If the treatments have been pre-assigned, then just create the Treat data frame by hand;

Treat <- data.frame(subject = c(1,4,2,6), treatment = c("a","b","b","a"))

If you have lots of these to do you can use functions like seq() and rep(), plus the inbuilt letters constant to speed up the "data entry".

End edit

We can now use this data frame in a merge with the original data to insert the treatment for the respective subject, using merge():

R> merge(dat, Treat)
   subject rate treatment
1        1   12         b
2        1   10         b
3        1   13         b
4        2    9         b
5        2    2         b
6        2    5         b
7        4    4         a
8        4    6         a
9        4   12         a
10       6   17         b
11       6   10         b
12       6    1         b


I will assume you have some key how to transform this data, like for instance 1,6=>a, 4,2=>b. Then the ifelse and %in% mix should do the job:

df$treatment<-factor(ifelse(df$subject%in%c('1','6'),'a','b'))

The more general option is to copy this factor and alter its levels, but the details are dependent on how do you have your dictionary stored. Simple example:

x<-df$subject; levels(x)<-c('a','b','b','a')
x->df$treatment

(In both examples I assume that subject is a factor)


An another approach may be writing a special function to decide the treatment with respect to subject and apply the function on subject to create a new treatment column.

Here is the code:

data <- data.frame(subject = as.numeric(rep(c(1,2,4,6)), each = 4), rate = sample(1:20, 16, TRUE))

cat = function(x){
  if (x == 1 || x == 4){return('a')}
  else if (x == 2 || x == 6 ) {return('b')}
  else { NaN}
}

data$treat = lapply(data$subject, cat)

head(data)

Output:

> head(data)
  subject rate treat
1       1   15     a
2       2   20     b
3       4    8     a
4       6   16     b
5       1   19     a
6       2    5     b
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜