Specifying formula in R with glm without explicit declaration of each covariate
I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.
For example (with just 3 variables):
n=200
set.seed(39)
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5))
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1)))))
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
If I want to include all main terms, this has an ea开发者_高级运维sy shortcut:
glm(Y~., family=binomial, data=samp)
But say I want to include all main terms (W1, W2, and A) plus W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
Is there a shortcut for this?
[editing self before publishing:] This works! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
Okay, so what about this one!
I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.
Your use of .
creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula using as.formula
:
vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)
Of course, you can make the fla
object way more complicated.
Aniko answered your question. To extend a bit :
You can also exclude variables using - :
glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)
For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :
vars <- data.frame(
names = names(samp),
main = c(T,F,T,F),
quadratic =c(F,T,T,F),
main2=c(T,T,F,F),
stringsAsFactors=F
)
regform <- paste(
"Y ~",
paste(
paste(vars[vars$main,1],collapse="+"),
paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
sep="+"
)
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"
> glm(as.formula(regform),data=samp,family=binomial)
Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.
精彩评论