开发者

Probabilistic latent semantic analysis/Indexing - Introduction

But recently I found this link quite h开发者_如何转开发elpful to understand the principles of LSA without too much math. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html. It forms a good basis on which I can build further.

currently, I'm looking out for a similar introduction to Probabilistic Latent Semantic Analysis/Indexing. Less of math and more of examples explaining the principles behind it. If you would know such an introduction, please let me know.

Can it be used to find the measure of similarity between sentences? Does it handle polysemy?

Is there a python implementation for the same?

Thank you.


There is a good talk by Thomas Hofmann that explains both LSA and its connections to Probabilistic Latent Semantic Analysis (PLSA). The talk has some math, but is much easier to follow than the PLSA paper (or even its Wikipedia page).

PLSA can be used to get some similarity measure between sentences, as two sentences can be viewed as short documents drawn from a probability distribution over latent classes. Your similarity will heavily depend on your training set though. The documents you use to training the latent class model should reflect the types of documents you want to compare. Generating a PLSA model with two sentences won't create meaningful latent classes. Similarly, training with a corpus of very similar contexts may create latent classes that are overly sensitive to slight changes on the documents. Moreover, because sentences contain relative few tokens (as compared to documents), I don't believe you'll get high quality similarity results from PLSA at the sentence level.

PLSA does not handle polysemy. However, if you are concerned with polysemy, you might try running a Word Sense Disambiguation tool over your input text to tag each word with its correct sense. Running PLSA (or LDA) over this tagged corpus will remove the effects of polysemy in the resulting document representations.

As Sharmila noted, Latent Dirichlet allocation (LDA) is considered the state of the art for document comparison, and is superior to PLSA, which tends to overfit the training data. In addition, there are many more tools to support LDA and analyze whether the results you get with LDA are meaningful. (If you're feeling adventurous, you can read David Mimno's two papers from EMNLP 2011 on how to assess the quality of the latent topics you get from LDA.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜