开发者

large-scale regression in R with a sparse feature matrix

I'd like to do large-scale regression (linear/logistic) in R with many (e.g. 100k) features, where each example is relatively sparse in the feature space---e.g., ~1k non-zero features per example.

It seems like the SparseM package slm should do this, but I'm having difficulty converting from the sparseMatrix format to a slm-friendly format.

I have a numeric vector of labels y and a sparseMatrix of features X \in {0,1}. When I try

model <- slm(y ~ X)

I get the following error:

Error in model.frame.default(formula = y ~ X) : 
invalid type (S4) for variable 'X'

presumably because slm wants a SparseM object instead of a sparseMatrix.

Is there an easy way to either a) populate a SparseM object directly or b) convert a sparseMatrix to a SparseM object? Or perhaps there's a better/simpler way to do this?

(I suppose I coul开发者_StackOverflowd explicitly code the solutions for linear regression using X and y, but it would be nice to have slm working.)


Don't know about SparseM but the MatrixModels package has an unexported lm.fit.sparse function that you can use. See ?MatrixModels:::lm.fit.sparse. Here is an example:

Create the data:

y <- rnorm(30)
x <- factor(sample(letters, 30, replace=TRUE))
X <- as(x, "sparseMatrix")
class(X)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dim(X)
# [1] 18 30

Run the regression:

MatrixModels:::lm.fit.sparse(t(X), y)
#  [1] -0.17499968 -0.89293312 -0.43585172  0.17233007 -0.11899582  0.56610302
#  [7]  1.19654666 -1.66783581 -0.28511569 -0.11859264 -0.04037503  0.04826549
# [13] -0.06039113 -0.46127034 -1.22106064 -0.48729092 -0.28524498  1.81681527

For comparison:

lm(y~x-1)

# Call:
# lm(formula = y ~ x - 1)
# 
# Coefficients:
#       xa        xb        xd        xe        xf        xg        xh        xj  
# -0.17500  -0.89293  -0.43585   0.17233  -0.11900   0.56610   1.19655  -1.66784  
#       xm        xq        xr        xt        xu        xv        xw        xx  
# -0.28512  -0.11859  -0.04038   0.04827  -0.06039  -0.46127  -1.22106  -0.48729  
#       xy        xz  
# -0.28524   1.81682  


A belated answer: glmnet will also support sparse matrices and both of the regression models requested. This can use the sparse matrices produced by the Matrix package. I advise looking into regularized models via this package. As sparse data often involves very sparse support for some variables, L1 regularization is useful for knocking these out of the model. It's often safer than getting some very spurious parameter estimates for variables with very low support.


glmnet is a good choice. Supports L1, L2 regularization for linear, logistic, and multinomial regression, among other options.

The only detail is it doesn't have a formula interface, so you have to create your model matrix. But here is where the gain is.

Here is a pseudo-example:

library(glmnet)
library(doMC)
registerDoMC(cores=4)

y_train <- class
x_train <- sparse.model.matrix(~ . -1, data=x_train)

# For example for logistic regression using L1 norm (lasso) 
cv.fit <- cv.glmnet(x=x_train, y=y_train, family='binomial', alpha=1, 
                    type.logistic="modified.Newton", type.measure = "auc",
                    nfolds=5, parallel=TRUE)

plot(cv.fit)


You might also get some mileage by looking here:

  • The biglm package.
  • The High Performance and Parallel Computing R task view.
  • A paper about Sparse Model Matrices for Generalized Linear Models (PDF), by Martin Machler and Douglas Bates from UseR 2010.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜