Calculate entropy of probability distribution of two data sets- text analysis & sentiment in C#
I'm using a 1.6M tweet corpus to train a naive bayes sentiment engine.
I have two Dictionaries of n-grams (Dictionary<string,int>
where the string
is my n-gram and the int
is the # of occurrences of the n-gram in my corpus). The first list is pulled from the positive tweets, the second list is pulled from the negative tweets. In an article on this subject, the authors discard common n-grams (i.e. n-grams that do not strongly indicate any sentiment nor indicate objectivity of a sentence. Such n-grams appear evenly in all datasets). I understand this quite well conceptually, but the formula they provide is rooted in mathematics, not code, and I'm not able to decipher what I'm supposed to be doing.
I have spent the last few hours searching the web for how to do this. I've found examples of entropy calculation for search engines, which is usually calculating the entropy of a string, and the most common code block has been ShannonsEntropy.
I'm also relatively new to this space, so I'm sure my ignorance is playing a bit of a part in this, but I'm hoping somebody on SO can help nudge me in the right direction. To summarize:
Given two Dictionaries, PosDictionary
& NegDictionary
, how do I calculate the entropy of identical n-grams?
Psuedo-code is fine, and I imagine it looks something like this:
foreach(string myNGram in PosDictionary) {
if(NegDictionary.ContainsKey(myNGram) {
double result = CalculateEntropyOfNGram(myNG开发者_运维问答ram);
if(result > someThetaSuchAs0.80) {
PosDictionary.Remove(myNGram);
NegDictionary.Remove(myNGram);
}
}
}
I think that's the process I'll need to take. What I don't know is what the CalculateEntropyOfNGram
function looks like...
(Edit) Here is the link to the pdf used to describe the entropy/salience process (section 5.3)
Equation (10) in the paper gives the definition. If you have problems reading the equation, it is a short notation for
H(..) = -log(p(S1|g)) * p(S1|g) - log(p(S2|g)) * p(S2|g) - ....
精彩评论