Building a naive Bayes classifier for spam filtering, in Lua


Classifying e-mail messages as spam or not spam (ham) is a classic application of Bayes classifiers. This post presents a simple implementation of a naive Bayes classifier by using Lua and lna.

Training and test data

The data set used for training is a 700 x 2500 matrix, where each line represents a single message and each column is a feature. Here, the features were extracted by using the bag-of-words model, so each column represents one word, such as the value of cell (i,j) contains the number of appearances of word j in message i. Half of the data set corresponds to ham messages, while the other half are examples of spam.

In other words, the messages were processed (removal of stop words, Lemmatisation) in a way that the total number of distinct words is equal to 2500, where each column represents a word. Information about the word ordering is not considered in this scheme.

For testing purposes, another data set with 260 messages was built using the same methodology as in the training data set.

The data used in this example is based on the data from this course, originally based on the Ling-Spam data set.

Methodology

The implementation follows the usual algorithm for a naive Bayes classifier. The theory is left for the reader to study elsewhere, as I am lazy and there are plenty of tutorials / courses involving Bayes classifiers online.

There are two details worth mentioning:

  1. This implementation takes the logarithm of the computed probability matrices, turning multiplication into summation for enhanced numerical precision (and to simplify the process too, as it now handles missing words automatically).

  2. For this data set, the classifier makes less mistakes when I ignored multiple occurrences of the same word on a message, i.e., limiting the features to {0,1} (word is present or not). Doing this reduces the number of errors from 5 to 4, but this is insufficient information to decide it is good or bad to do clip the values to the {0,1} set.

Implementation

require('lna.lab')

-- ** training **

-- load the feature (X) and label (y) matrices
local X = matrix.load("train-features", " ")
local y = matrix.load("train-labels"  , " ")

-- separate in the 'ham' and 'spam' groups
local ham  = X:filter(function(_,i) return y(i) == 0 end):clip(0,1,true)
local spam = X:filter(function(_,i) return y(i) == 1 end):clip(0,1,true)

-- compute the values of P(Fi | C), for every feature Fi and class C
local dict_wc, ham_wc, spam_wc  = ham.n, ham:sum(), spam:sum()
local prob_ham  = (ham:summ() + 1):div(dict_wc + ham_wc ):t()
local prob_spam = (spam:summ()+ 1):div(dict_wc + spam_wc):t()

-- ** test **

-- load the test feature and label matrices
local Xt = matrix.load("test-features", " "):clip(0,1,true)
local yt = matrix.load("test-labels"  , " "):clip(0,1,true)

-- Uses log() to avoid precision errors and to convert and to
-- matemagically skip missing words that are not present in the message
local res_ham  = Xt * prob_ham:log()
local res_spam = Xt * prob_spam:log()
local is_spam  = res_spam:gt(res_ham)

local errors = is_spam:xor(yt):sum()
print(string.format('Errors: %d (%.2f%%)', errors, (errors / Xt.m) * 100))

Results

With the test data set, 4 (out of 260, or 1.54%) messages are misclassified. Testing with the training data only results in 5 errors (out of 700, or 0.71%).

This simple implementation runs in about 1s on my 7 years old laptop, and about 85% of the time is spend reading the feature matrices (in 3 the author(s) used a sparse matrix, greatly enhancing the load time).

The prepared training and test data, along the sample implementation, can be downloaded here (original source: Ling-Spam data set).