Using document length in the Naive Bayes Classifier of NLTK Python

2023-02-16 06:45 问答作者：

I am building a spam filter using the NLTK in Python. I now check for the occurances of words and use the NaiveBayesClassifier resulting in an accuracy of .98 and F measure for spam of .92 and for non-spam: 0.98. However when checking the documents in which my program errors I notice that a lot of spam that is class开发者_运维百科ified as non-spam are very short messages.

So I want to put the length of a document as a feature for the NaiveBayesClassifier. The problem is it now only handles binary values. Is there any other way to do this than for example say: length<100 =true/false?

(p.s. I have build the spam detector analogous to the http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html example)

NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).

You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).

Here's one way of going about estimating P(len):

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:

p.cdf(log(L+1)) - p.cdf(log(L))

The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).

There are MultiNomial NaiveBayes algorithms that can handle range values, but not implemented in NLTK. For the NLTK NaiveBayesClassifier, you could try having a couple different length thresholds as binary features. I'd also suggest trying a Maxent Classifier to see how it handles smaller text.

继续阅读：feature-detection nltk python spam-prevention

Using document length in the Naive Bayes Classifier of NLTK Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？