Probabilistic latent semantic analysis/Indexing - Introduction

2023-03-15 05:11 问答作者：

But recently I found this link quite h开发者_如何转开发elpful to understand the principles of LSA without too much math. http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html. It forms a good basis on which I can build further.

currently, I'm looking out for a similar introduction to Probabilistic Latent Semantic Analysis/Indexing. Less of math and more of examples explaining the principles behind it. If you would know such an introduction, please let me know.

Can it be used to find the measure of similarity between sentences? Does it handle polysemy?

Is there a python implementation for the same?

Thank you.

There is a good talk by Thomas Hofmann that explains both LSA and its connections to Probabilistic Latent Semantic Analysis (PLSA). The talk has some math, but is much easier to follow than the PLSA paper (or even its Wikipedia page).

PLSA can be used to get some similarity measure between sentences, as two sentences can be viewed as short documents drawn from a probability distribution over latent classes. Your similarity will heavily depend on your training set though. The documents you use to training the latent class model should reflect the types of documents you want to compare. Generating a PLSA model with two sentences won't create meaningful latent classes. Similarly, training with a corpus of very similar contexts may create latent classes that are overly sensitive to slight changes on the documents. Moreover, because sentences contain relative few tokens (as compared to documents), I don't believe you'll get high quality similarity results from PLSA at the sentence level.

PLSA does not handle polysemy. However, if you are concerned with polysemy, you might try running a Word Sense Disambiguation tool over your input text to tag each word with its correct sense. Running PLSA (or LDA) over this tagged corpus will remove the effects of polysemy in the resulting document representations.

As Sharmila noted, Latent Dirichlet allocation (LDA) is considered the state of the art for document comparison, and is superior to PLSA, which tends to overfit the training data. In addition, there are many more tools to support LDA and analyze whether the results you get with LDA are meaningful. (If you're feeling adventurous, you can read David Mimno's two papers from EMNLP 2011 on how to assess the quality of the latent topics you get from LDA.)

继续阅读：latent-semantic-indexing lsa

Probabilistic latent semantic analysis/Indexing - Introduction

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？