Log likelihood to implement Naive Bayes for Text Classification

2023-02-20 15:21 问答作者：

I am implementing Naive Bayes algorithm for text classification. I have ~1000 documents for training and 400 documents for testing. I think I've implemented training part correctly, but I am confused in testing part. Here is what I've done briefly:

In my training function:

vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection

spamModelArray[vocabularySize]; 
nonspamModelArray[vocabularySize];

for each training_file{
        class = GetClassLabel(); // 0 for spam or 1 = non-spam
        document = GetDocumentID();

      开发者_StackOverflow中文版  counterTotalTrainingDocs ++;

        if(class == 0){
                counterTotalSpamTrainingDocs++;
        }

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                if(class = 0){ //SPAM
                        spamModelArray[id]+= freq;
                        totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
                }else{ // NON-SPAM
                        nonspamModelArray[id]+= freq;
                        totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
                }
        }//for


        for i in vocabularySize{
                spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
                nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;

        }//for


        priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}

I think I understood and implemented training part correctly, but I am not sure I could implemented testing part properly. In here, I am trying to go through each test document and I calculate logP(spam|d) and logP(non-spam|d) for each document. Then I compare these two quantities in order to determine the class (spam/non-spam).

In my testing function:

vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
        document = getDocumentID;

        logProbabilityofSpam = 0;
        logProbabilityofNonSpam = 0;

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                // logP(w1w2.. wn) = C(wj)∗logP(wj)
                logProbabilityofSpam+= freq*log(spamModelArray[id]);
                logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
        }//for

        // Now I am calculating the probability of being spam for this document
        if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
                newclass = 1; //not spam
        }else{
                newclass = 0; // spam
        }

}//for

My problem is; I want to return the probability of each class instead of exact 1's and 0's (spam/non-spam). I want to see e.g. newclass = 0.8684212 so I can apply threshold later on. But I am confused here. How can I calculate the probability for each document? Can I use logProbabilities to calculate it?

The probability of the data described by a set of features {F1, F2, ..., Fn} belonging in class C, according to the naïve Bayes probability model, is

P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)

You have all the terms (in logarithmic form), except for the 1 / P( F1, ..., Fn) term since that's not used in the naïve Bayes classifier that you're implementing. (Strictly, the MAP classifier.)

You'd have to collect frequencies of the features as well, and from them calculate

P(F1, ..., Fn) = P(F1) * ... * P(Fn)

继续阅读：bayesian data-mining machine-learning probability text-mining

Log likelihood to implement Naive Bayes for Text Classification

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？