开发者

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

I have created a DocumentTermMatrix that contains 1859 documents (rows) and开发者_开发知识库 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?


The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"            

The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give you the exact same output.


DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..


Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes

Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.


The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.


Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜