Performing a SVD on tweets. Memory problem
EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.
EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.
I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word counts for each person. Like this I am getting characteristic vectors for each person.
I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:
should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much w开发者_如何学Cith this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....
--> So in general I am asking for advice how to perform a svd on such a corpus.
This is a big dense matrix. However, it is only a small a small sparse matrix.
Using a sparse matrix SVD algorithm is enough. e.g. here.
SVD is constrained by your memory size. See:
Folding In: a paper on partial matrix updates.
Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD
精彩评论