Classifying e-mail messages as spam or not spam (ham) is a classic application of Bayes classifiers. This post presents a simple implementation of a naive Bayes classifier by using Lua and lna.

## Training and test data

The data set used for training is a 700 x 2500 matrix, where each line represents a single message and each column is a feature. Here, the features were extracted by using the bag-of-words model, so each column represents one word, such as the value of cell (i,j) contains the number of appearances of word j in message i. Half of the data set corresponds to ham messages, while the other half are examples of spam.

In other words, the messages were processed (removal of stop words, Lemmatisation) in a way that the total number of distinct words is equal to 2500, where each column represents a word. Information about the word ordering is not considered in this scheme.

For testing purposes, another data set with 260 messages was built using the same methodology as in the training data set.

The data used in this example is based on the data from this course, originally based on the Ling-Spam data set.

## Methodology

The implementation follows the usual algorithm for a naive Bayes classifier. The theory is left for the reader to study elsewhere, as I am lazy and there are plenty of tutorials / courses involving Bayes classifiers online.

There are two details worth mentioning:

1. This implementation takes the logarithm of the computed probability matrices, turning multiplication into summation for enhanced numerical precision (and to simplify the process too, as it now handles missing words automatically).

2. For this data set, the classifier makes less mistakes when I ignored multiple occurrences of the same word on a message, i.e., limiting the features to {0,1} (word is present or not). Doing this reduces the number of errors from 5 to 4, but this is insufficient information to decide it is good or bad to do clip the values to the {0,1} set.

## Results

With the test data set, 4 (out of 260, or 1.54%) messages are misclassified. Testing with the training data only results in 5 errors (out of 700, or 0.71%).

This simple implementation runs in about 1s on my 7 years old laptop, and about 85% of the time is spend reading the feature matrices (in 3 the author(s) used a sparse matrix, greatly enhancing the load time).

The prepared training and test data, along the sample implementation, can be downloaded here (original source: Ling-Spam data set).