calculating probability distribution

2023-03-18 08:42 问答作者：

I have a sim开发者_运维技巧ple (may be stupid) question. I want to calculate Kullback–Leibler divergence on two documents. It requires probability distribution of each document.

I do not know how to calculate probability for each document. Any simple answer with layman example would be much appreciated.

Let's say we have follow two documents:

1 - cross validated answers are good 
2 - simply validated answers are nice

(wording of the documents is just bla bla to give you an example)

How do we calculate probabilities for these documents?

Let's say we add one more document:

3 - simply cross is not good answer

If we add another document then how would it impact probability distribution?

Thanks

If you add a document to a collection of documents, unless that document is exactly the same as the document collection, the distribution of words or terms in your distribution is going to change to accommodate the newly added words. The question arises: "Is that really what you want to do with the third document?"

Kullback-Leibler divergence is a measure of divergence for two distributions. What are you two distributions?

If your distribution is the probability of a certain word being selected at random in a document, then the space over which you have probability values is the collection of words which make up your documents. For your first two documents (I assume this is your entire collection), you can build a word-space of 7 terms. The Probability for a word being selected at random from the documents as bags of words are:

            doc 1     doc 2            doc 3   (lem)
answers      0.2       0.2              0.0     0.2
are          0.2       0.2              0.0     0.2
cross        0.2       0.0              .33     0.2
good         0.2       0.0              .33     0.2
nice         0.0       0.2              0.0     0.0
simply       0.0       0.2              .33     0.2
validated    0.2       0.2              0.0     0.0

[This is calculated as the term-frequency divided by the document lengths. Notice that the new document has word forms that aren't the same as the words in doc 1 and doc 2. The (lem) column would be the probabilities if you stemmed or lemmatized to the same term the pairs (are/is) and (answer/answers).]

Introducing the third document into the scenario, a typical activity you might want to do with Kullback-Liebler Divergence is compare a new document or collection of documents with already-known documents or collections of documents.

Computing the Kullback-Liebler divergence D(P||Q) produces a value signifying how well the true distribution P is captured by using the substitute distribution Q. So Q1 could be the distribution of words in doc 1, and Q2 could be the distribution of words in doc 2. Computing the KL divergence with P being the distribution of words in the new document (doc 3), you can get measures of how divergent the new document is from doc 1 and how divergent it is from doc 2. Using this information, you can say how similar the new document is to your know documents/collections.

继续阅读：information-retrieval math probability statistics

calculating probability distribution

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？