Problem for lsi

2022-12-17 09:19 问答作者：

I am using Latent semantic analysis for text similarity. I have 2 questions.

How to select K value for dimention reduction?
I read alot every where that LSI work for simi开发者_运维百科lary meaning words for example car and automobile. How is it possible??? What is the magic step I am missing here?

The typical choice for k is 300. Ideally, you set k based on an evaluation metric that uses the reduced vectors. For example, if you're clustering documents, you could select the k that maximizes the clustering solution score. If you don't have a benchmark to measure against, then I would set k based on how big your data set is. If you only have 100 documents, then you wouldn't expect to need several hundred latent factors to represent them. Likewise, if you have a million documents, then 300 may be too small. However, in my experience the resulting vectors are fairly robust to large changes in k, provided that k is not too small (i.e., k = 300 does about as well as k = 1000).
You might be confusing LSI with Latent Semantic Analysis (LSA). They're very related techniques, with the difference being that LSI operates on documents, and LSA operates on words. Both approaches use the same input (a term x document matrix). There are several good open source LSA implementations if you would like to try them. The LSA wikipedia page has a comprehensive list.

try a couple of different values from [1..n] and see what works for whatever task you are trying to accomplish
Make a word-word correlation matrix [ i.e. cell(i,j) holds the # of docs where (i,j) co-occur ] and use something like PCA on it

精彩评论