Log likelihood to implement Naive Bayes for Text Classification
I am implementing Naive Bayes algorithm for text classification. I have ~1000 documents for training and 400 documents for testing. I think I've implemented training part correctly, but I am confused in testing part. Here is what I've done briefly:
In my training function:
vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection
spamModelArray[vocabularySize];
nonspamModelArray[vocabularySize];
for each training_file{
class = GetClassLabel(); // 0 for spam or 1 = non-spam
document = GetDocumentID();
开发者_StackOverflow中文版 counterTotalTrainingDocs ++;
if(class == 0){
counterTotalSpamTrainingDocs++;
}
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
if(class = 0){ //SPAM
spamModelArray[id]+= freq;
totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
}else{ // NON-SPAM
nonspamModelArray[id]+= freq;
totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
}
}//for
for i in vocabularySize{
spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;
}//for
priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}
I think I understood and implemented training part correctly, but I am not sure I could implemented testing part properly. In here, I am trying to go through each test document and I calculate logP(spam|d) and logP(non-spam|d) for each document. Then I compare these two quantities in order to determine the class (spam/non-spam).
In my testing function:
vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
document = getDocumentID;
logProbabilityofSpam = 0;
logProbabilityofNonSpam = 0;
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
// logP(w1w2.. wn) = C(wj)∗logP(wj)
logProbabilityofSpam+= freq*log(spamModelArray[id]);
logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
}//for
// Now I am calculating the probability of being spam for this document
if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
newclass = 1; //not spam
}else{
newclass = 0; // spam
}
}//for
My problem is; I want to return the probability of each class instead of exact 1's and 0's (spam/non-spam). I want to see e.g. newclass = 0.8684212 so I can apply threshold later on. But I am confused here. How can I calculate the probability for each document? Can I use logProbabilities to calculate it?
The probability of the data described by a set of features {F1, F2, ..., Fn} belonging in class C, according to the naïve Bayes probability model, is
P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)
You have all the terms (in logarithmic form), except for the 1 / P( F1, ..., Fn) term since that's not used in the naïve Bayes classifier that you're implementing. (Strictly, the MAP classifier.)
You'd have to collect frequencies of the features as well, and from them calculate
P(F1, ..., Fn) = P(F1) * ... * P(Fn)
精彩评论