tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

2023-04-04 16:26 问答作者：

I have created a DocumentTermMatrix that contains 1859 documents (rows) and开发者_开发知识库 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"

The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

and you're done.

A naive comparison between your objects:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

will each give you the exact same output.

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes

Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.

inspect(removeSparseTerms(dtm, 0.7))

It removes terms that has at least a sparsity of 0.7.

Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix

a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))

use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

继续阅读：tm

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？