开发者

How Gmail spam filter works?

I'm always surprised by the high quality of Gmail spam filter. For the last year, it filtered 99.95% of the spam, and blocked by mistake only one mail. By comparison, any other mail service I used makes at least one mistake for every 50 mails.

How, internally, Gmail does to reach this level of quality? Is it based on customers fe开发者_StackOverflowedback (ie. if N customers block mail as spam, it is sorted as spam for every other customer)? Or there is some trick? Maybe a basic filter algorithm filters the most obvious spam, and some difficult cases are analyzed by real humans?


Briefly speaking this is based on the community feedback. Here is a citation from official explanation:

Gmail users play an important role in keeping spammy messages out of millions of inboxes. When the Gmail community votes with their clicks to report a particular email as spam, our system quickly learns to start blocking similar messages. The more spam the community marks, the smarter our system becomes.

You can read a bit more about it on their Spam Explained page.


This is the million dollar question, and if it were able to be answered on stackOverflow, then everyones spam filter would be as effective.


I don't really know how exactly Google does SPAM filtering (but I think it's a business secret after all). If you are interested in how SPAM filtering works, I would recommend looking at Bayesian SPAM filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering). It's a rather easy to understand method.


Google is most likely using a classifier system, such as Logistic Regression or Neural Networks. State of the art spam detection frequently employs Machine Learning algorithms such as these.

The output classification is "Spam" or "Not Spam," and the inputs, I'm sure, are top secret at Google, but I'm sure certain email text phrases such as "Buy Now," "On Sale," "Viagra," or "Male Enhancement" are all factors in their model.


There is no Official release on this, and most of the suggestions are just observations/experts view.

Based on my observations on emails we deliver, here are my findings:

1. User engagement is the key: If users are not engaging in your emails then your emails are bound to be flagged as spam. Here are some metrics: - Whom you email, and how often you email them - Which emails you open - Which emails you reply to - Keywords that are in emails you usually read - Which emails you star, archive, or delete

2. Sender Domain Reputation: What is the past history of the sending domain? If in past the user engagement was higher then probability of the new email from the same domain landing in Inbox is high.

Google is using complex AI and Machine learning algorithms to make this happen. While you might get some success by changing the IP, domain or return-path, but all that will be a very short term hacks.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜