Publicly Available Spam Filter Training Set [closed]

2023-02-05 21:32 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).

I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just开发者_开发技巧 haven't been able to find the same sort of thing for emails.

Here is what I was looking for: http://untroubled.org/spam/

This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com

Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.

I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.

A few characteristics of the Spambase set:

4601 data points--all complete
each comprised of 58 features (attributes)
each data point is labelled 'spam' or 'no spam'
approx. 40% are labeled spam
of the features, all are continuous (vs. discrete)
a representative feature: average continuous sequence of capital letters

Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.

SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.

You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.

The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).

The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/

A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.

I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

继续阅读：machine-learning spam-prevention training-data

Publicly Available Spam Filter Training Set [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？