开发者

Discriminant Analysis with R

I was running discriminant analysis using R. The code is the following:

fit <- lda(group~ A+C1_1+C2+D1a_1+D2_1+D3_1+D3_2+D3_3+E1a_1+E1b_1+E1b开发者_StackOverflow中文版_2+E2_1+E3_1+E3_2+E3_3+F2+G_1+G_2+G_3+G_4+H1_1+H2a_1+H2b_1+H3_1+H4_1_1+H1_2+H2a_2+H2b_2+ H3_2+H4_1_2+J1_1+J2_1+J3_1+K1a+K2_1+K2_2+K2_3+K2_4,data=data1)

But unfortunately I was getting the following error:

Error in x - group.means[g, ] : non-conformable arrays

This is the str(data1) output:

'data.frame':   210 obs. of  133 variables:

 $ A               : int  1 1 1 1 1 1 1 2 1 2 ...

 $ C1_1            : int  22 29 12 12 25 15 30 20 30 15 ...
 $ C2              : int  2 2 2 2 2 2 2 1 2 2 ...
 $ D1a_1           : int  40 50 160 15 150 105 150 45 100 80 ...
 $ D2_1            : int  100 100 100 100 100 100 100 90 95 100 ...
 $ D3_1            : int  5 15 40 10 30 25 30 40 25 60 ...
 $ D3_2            : int  10 30 30 15 30 25 60 40 20 10 ...
 $ D3_3            : int  10 30 30 10 10 15 10 20 20 30 ...
 $ E1a_1           : int  80 25 140 30 150 120 80 30 100 100 ...
 $ E1b_1           : int  100 50 50 25 80 70 80 75 10 75 ...
 $ E1b_2           : int  0 50 50 75 20 30 20 25 90 25 ...
 $ E2_1            : int  20 60 75 70 60 80 75 100 60 80 ...
 $ E3_1            : int  5 20 20 5 30 20 25 25 10 30 ...
 $ E3_2            : int  10 20 40 15 30 20 50 50 10 30 ...
 $ E3_3            : int  10 20 15 10 10 20 25 25 10 40 ...
 $ G_1             : int  5 50 20 25 80 10 30 25 35 5 ...
 $ G_2             : int  0 10 50 50 10 10 30 30 30 10 ...
 $ G_3             : int  90 30 20 25 10 50 5 30 15 80 ...
 $ G_4             : int  5 10 10 0 0 30 35 15 20 5 ...
 $ H1_1            : int  1 3 3 2 3 2 3 3 2 3 ...
 $ H2a_1           : int  NA NA NA 1 NA 2 NA NA 1 NA ...

 $ H2b_1           : int  NA 2 1 NA 2 NA 1 1 NA 1 ...

 $ H3_1            : int  2 2 2 2 2 3 3 3 2 2 ...

 $ H4_1_1          : int  6 5 7 6 3 6 5 6 5 5 ...

 $ J1_1            : int  4 6 4 4 4 4 6 7 3 3 ...
 $ J2_1            : int  2 6 5 3 4 4 1 2 3 3 ...
 $ J3_1            : int  4 5 3 3 4 4 6 7 3 4 ...

 $ K1a             : int  2 2 2 2 2 2 2 2 1 1 ...

 $ K2_1            : int  NA NA NA NA NA NA NA NA 0 0 ...
 $ K2_2            : int  NA NA NA NA NA NA NA NA 1 0 ...
 $ K2_3            : int  NA NA NA NA NA NA NA NA 0 1 ...
 $ K2_4            : int  NA NA NA NA NA NA NA NA 0 0 ...

  [list output truncated

]]

Second, can anybody please tell me how to get the significance level of the variables used in the discriminant analysis.


Works fine with a randomly generated data set with no NA values:

set.seed(101)
z <- matrix(runif(210*133),nrow=210)
zz <- data.frame(A=sample(1:2,size=210,replace=TRUE),z)
m <- MASS::lda(A~.,data=zz)

I can reproduce the error if I add enough NAs:

z2 <- z
z2[sample(length(z),size=2000)] <- NA
zz2 <- data.frame(A=sample(1:2,size=210,replace=TRUE),z2)
m <- MASS::lda(A~.,data=zz2)

results in

Error in x - group.means[g, ] : non-conformable arrays

(if I knock out fewer I get warnings about collinearity instead)

For a start, try removing all variables with any NA values (or those with more than a few) and see if you can get it to work.

For the p value part of the question: googling "+r MASS lda discriminant analysis" leads to http://www.statmethods.net/advstats/discriminant.html and suggests (and provides a link to) MANOVA for these p values.

Based on a little bit of googling, it looks like people usually use MANOVA with Wilks' lambda for tests in the context of LDA: for example, http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discrim.pdf says

Discriminant function analysis is broken into a 2-step process: (1) testing significance of a set of discriminant functions, and; (2) classification. The first step is computationally identical to MANOVA.

They go on to show an example of using Wilks' lambda, although ?manova says that the Pillai-Bartlett test (which is the default in manova) may be better ... in any case, it's pretty easy to do the test.

> summary(manova(z~zz$A),test="Wilks")
           Df   Wilks approx F num Df den Df Pr(>F)
zz$A        1 0.38164  0.92587    133     76 0.6545
Residuals 208     

This of course is not exactly what you asked for -- you asked (I think) for the significance level associated with individual variables rather than with the overall test. I can imagine you could do something via appropriately multiplicity-corrected logistic regression, but this is turning into a statistical rather than an R question. If you don't get any further answers here you might consider asking an appropriately reformulated question on http://stats.stackexchange.com , referencing this question ...


Another potential question is because the data set faces the rank deficient problem. That is, some variables in one of the classes are exactly the same.

If we checked with "NA"s and there isn't any, we could also run a quick test:

lda(Y ~ Var1, data = Data.Name)

If it works, then we can find an appropriate subset of variables manually by changing the varlist in the code block below:

# a subset of all of your predictors, I recommend start from some simple ones
varlist <- c(1:5, 8:10) 
# get the colnames from the data frame
Col.Name <- colnames(Data.Name)[varlist]
# form a formula
fun <- paste("Response ~", paste(as.character(Col.Name), sep = "=", collapse = " + "))
fun <- formula(fun)
# pass it to the lda function
lda(fun, data = Data.Name)

Those variables that we just excluded should have the same number within at least one of the classes, we can check it by looking back at our raw data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜