calculating probability distribution
I have a sim开发者_运维技巧ple (may be stupid) question. I want to calculate Kullback–Leibler divergence on two documents. It requires probability distribution of each document.
I do not know how to calculate probability for each document. Any simple answer with layman example would be much appreciated.
Let's say we have follow two documents:
1 - cross validated answers are good
2 - simply validated answers are nice
(wording of the documents is just bla bla to give you an example)
How do we calculate probabilities for these documents?
Let's say we add one more document:
3 - simply cross is not good answer
If we add another document then how would it impact probability distribution?
Thanks
If you add a document to a collection of documents, unless that document is exactly the same as the document collection, the distribution of words or terms in your distribution is going to change to accommodate the newly added words. The question arises: "Is that really what you want to do with the third document?"
Kullback-Leibler divergence is a measure of divergence for two distributions. What are you two distributions?
If your distribution is the probability of a certain word being selected at random in a document, then the space over which you have probability values is the collection of words which make up your documents. For your first two documents (I assume this is your entire collection), you can build a word-space of 7 terms. The Probability for a word being selected at random from the documents as bags of words are:
doc 1 doc 2 doc 3 (lem)
answers 0.2 0.2 0.0 0.2
are 0.2 0.2 0.0 0.2
cross 0.2 0.0 .33 0.2
good 0.2 0.0 .33 0.2
nice 0.0 0.2 0.0 0.0
simply 0.0 0.2 .33 0.2
validated 0.2 0.2 0.0 0.0
[This is calculated as the term-frequency divided by the document lengths. Notice that the new document has word forms that aren't the same as the words in doc 1 and doc 2. The (lem) column would be the probabilities if you stemmed or lemmatized to the same term the pairs (are/is) and (answer/answers).]
Introducing the third document into the scenario, a typical activity you might want to do with Kullback-Liebler Divergence is compare a new document or collection of documents with already-known documents or collections of documents.
Computing the Kullback-Liebler divergence D(P||Q)
produces a value signifying how well the true distribution P
is captured by using the substitute distribution Q
. So Q1
could be the distribution of words in doc 1, and Q2
could be the distribution of words in doc 2. Computing the KL divergence with P
being the distribution of words in the new document (doc 3), you can get measures of how divergent the new document is from doc 1 and how divergent it is from doc 2. Using this information, you can say how similar the new document is to your know documents/collections.
精彩评论