Discriminant Analysis with R
I was running discriminant analysis using R. The code is the following:
fit <- lda(group~ A+C1_1+C2+D1a_1+D2_1+D3_1+D3_2+D3_3+E1a_1+E1b_1+E1b开发者_StackOverflow中文版_2+E2_1+E3_1+E3_2+E3_3+F2+G_1+G_2+G_3+G_4+H1_1+H2a_1+H2b_1+H3_1+H4_1_1+H1_2+H2a_2+H2b_2+ H3_2+H4_1_2+J1_1+J2_1+J3_1+K1a+K2_1+K2_2+K2_3+K2_4,data=data1)
But unfortunately I was getting the following error:
Error in x - group.means[g, ] : non-conformable arrays
This is the str(data1) output:
'data.frame': 210 obs. of 133 variables:
$ A : int 1 1 1 1 1 1 1 2 1 2 ...
$ C1_1 : int 22 29 12 12 25 15 30 20 30 15 ...
$ C2 : int 2 2 2 2 2 2 2 1 2 2 ...
$ D1a_1 : int 40 50 160 15 150 105 150 45 100 80 ...
$ D2_1 : int 100 100 100 100 100 100 100 90 95 100 ...
$ D3_1 : int 5 15 40 10 30 25 30 40 25 60 ...
$ D3_2 : int 10 30 30 15 30 25 60 40 20 10 ...
$ D3_3 : int 10 30 30 10 10 15 10 20 20 30 ...
$ E1a_1 : int 80 25 140 30 150 120 80 30 100 100 ...
$ E1b_1 : int 100 50 50 25 80 70 80 75 10 75 ...
$ E1b_2 : int 0 50 50 75 20 30 20 25 90 25 ...
$ E2_1 : int 20 60 75 70 60 80 75 100 60 80 ...
$ E3_1 : int 5 20 20 5 30 20 25 25 10 30 ...
$ E3_2 : int 10 20 40 15 30 20 50 50 10 30 ...
$ E3_3 : int 10 20 15 10 10 20 25 25 10 40 ...
$ G_1 : int 5 50 20 25 80 10 30 25 35 5 ...
$ G_2 : int 0 10 50 50 10 10 30 30 30 10 ...
$ G_3 : int 90 30 20 25 10 50 5 30 15 80 ...
$ G_4 : int 5 10 10 0 0 30 35 15 20 5 ...
$ H1_1 : int 1 3 3 2 3 2 3 3 2 3 ...
$ H2a_1 : int NA NA NA 1 NA 2 NA NA 1 NA ...
$ H2b_1 : int NA 2 1 NA 2 NA 1 1 NA 1 ...
$ H3_1 : int 2 2 2 2 2 3 3 3 2 2 ...
$ H4_1_1 : int 6 5 7 6 3 6 5 6 5 5 ...
$ J1_1 : int 4 6 4 4 4 4 6 7 3 3 ...
$ J2_1 : int 2 6 5 3 4 4 1 2 3 3 ...
$ J3_1 : int 4 5 3 3 4 4 6 7 3 4 ...
$ K1a : int 2 2 2 2 2 2 2 2 1 1 ...
$ K2_1 : int NA NA NA NA NA NA NA NA 0 0 ...
$ K2_2 : int NA NA NA NA NA NA NA NA 1 0 ...
$ K2_3 : int NA NA NA NA NA NA NA NA 0 1 ...
$ K2_4 : int NA NA NA NA NA NA NA NA 0 0 ...
[list output truncated
]]
Second, can anybody please tell me how to get the significance level of the variables used in the discriminant analysis.
Works fine with a randomly generated data set with no NA
values:
set.seed(101)
z <- matrix(runif(210*133),nrow=210)
zz <- data.frame(A=sample(1:2,size=210,replace=TRUE),z)
m <- MASS::lda(A~.,data=zz)
I can reproduce the error if I add enough NA
s:
z2 <- z
z2[sample(length(z),size=2000)] <- NA
zz2 <- data.frame(A=sample(1:2,size=210,replace=TRUE),z2)
m <- MASS::lda(A~.,data=zz2)
results in
Error in x - group.means[g, ] : non-conformable arrays
(if I knock out fewer I get warnings about collinearity instead)
For a start, try removing all variables with any NA
values (or those with more than a few) and see if you can get it to work.
For the p value part of the question: googling "+r MASS lda discriminant analysis" leads to http://www.statmethods.net/advstats/discriminant.html and suggests (and provides a link to) MANOVA for these p values.
Based on a little bit of googling, it looks like people usually use MANOVA with Wilks' lambda for tests in the context of LDA: for example, http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discrim.pdf says
Discriminant function analysis is broken into a 2-step process: (1) testing significance of a set of discriminant functions, and; (2) classification. The first step is computationally identical to MANOVA.
They go on to show an example of using Wilks' lambda, although ?manova
says that the Pillai-Bartlett test (which is the default in manova
) may be better ... in any case, it's pretty easy to do the test.
> summary(manova(z~zz$A),test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
zz$A 1 0.38164 0.92587 133 76 0.6545
Residuals 208
This of course is not exactly what you asked for -- you asked (I think) for the significance level associated with individual variables rather than with the overall test. I can imagine you could do something via appropriately multiplicity-corrected logistic regression, but this is turning into a statistical rather than an R question. If you don't get any further answers here you might consider asking an appropriately reformulated question on http://stats.stackexchange.com , referencing this question ...
Another potential question is because the data set faces the rank deficient problem. That is, some variables in one of the classes are exactly the same.
If we checked with "NA"s and there isn't any, we could also run a quick test:
lda(Y ~ Var1, data = Data.Name)
If it works, then we can find an appropriate subset of variables manually by changing the varlist in the code block below:
# a subset of all of your predictors, I recommend start from some simple ones
varlist <- c(1:5, 8:10)
# get the colnames from the data frame
Col.Name <- colnames(Data.Name)[varlist]
# form a formula
fun <- paste("Response ~", paste(as.character(Col.Name), sep = "=", collapse = " + "))
fun <- formula(fun)
# pass it to the lda function
lda(fun, data = Data.Name)
Those variables that we just excluded should have the same number within at least one of the classes, we can check it by looking back at our raw data.
精彩评论