Bayesian email filtering is a type of machine-learning algorithm which filters email into spam or ham. We, as humans, can quickly identify what is junk mail, but how do you train a computer to do it?

Along comes Bayes. Boom. Before reaching your spam box, the filter applies Bayes’ Theorem in the following way:

1. The filter scans the “tokens” in your email and creates a database. Tokens being keywords, IP addresses, HTML tags, etc.

2. The filter calculates the “spamicity” of each word in your email. The spamicity, which ranges from 0 to 1, is the probability that a certain word is spam. A neutral word like “that” would have a spamicity value of 0.5 and therefore not affect the filter’s decision.

3. The spamicity is calculated based on how frequently a “token” appears in spam emails versus ham emails. See the example below.

4. The filter makes a decision.

Bayesian filtering is adaptive and in order for it to be effective, it needs to be trained. If an email erroneously gets sent to your spam inbox and you click “not spam”, then it will learn. We cannot apply a universal algorithm to every email user as user preferences and patterns are not the same (1).

Here is an example of how it works:

Before we collect any data we make a prior judgment. For example, the probability that a certain message is spam is 0.5.

P(spam)=0.5

P(ham) = 0.5

Now we collect data.

Say we receive 20 emails.

10 emails are spam emails and 8 of them contain the word “viagra.”

10 emails are ham emails. 1 contains the word “viagra.”

And that is our very basic database.

Now we test the filter. Another email with the word “viagra” comes in. Do we put this email into the junk mail folder? We need to calculate the probability that this email is spam given it contains the word viagra.

Bayes’ Theorem:

P(spam|”viagra”) is the probability that a message is spam, given the word “viagra” is in it. This is what we are trying to calculate.

P(“viagra”|spam) is the probability that the word “viagra” is in a spam message, which is 8 out of 10.

P(“spam”) is the probability that any message is spam. This is 0.5 as per our prior.

P(“viagra”|ham) is the probability that the word “viagra” is in a ham message. 1 out of 10 emails.

P(ham) is the probability that any message is ham. Again, 0.5.

Therefore, there is a 88.89% chance that our email is spam.

Combining Probabilities

Most filters use a Naive Bayes classifier which does not assume conditional dependence, but conditional independence. For example, even though “free” and “Nigerian” both contribute to the probability of spam, the existence of token “free” does not affect the existence of token “Nigerian.” All of the probabilities are calculated separately, but combined they return the overall probability of spam.

Here’s how it works for more tokens:

The filter scans for high spamicity tokens e.g. free, Nigerian, enlargement. We want to calculate the probability that a message is spam given it contains tokens from the Spam Database.

, where SD represents the Spam Database

The filter also scans for low spamicity tokens e.g. coffee, thoughts, dinner. We compute the probability that a message is not spam given in contains tokens from the Ham Database.

, where HD represents the Ham Database.

If P(spam|SD) > P(Not spam| HD) then the server will put the email into the spam inbox (2).

References

1. Process Software. http://www.process.com/precisemail/bayesian_filtering.htm

2. Nguyen, Khuong An. Spam Filtering with Naive Bayesian Classification. University of Cambridge. 09-April-11. http://khuong.vn/Papers/SpamFilter.pdf