In my previous posts (Part 1 | Part 2 | Part 3), I described the k-nearest neighbors algorithm, applied a benchmark model to the classification of hand-written digits, then chose an optimal value for k as the one that minimized the model’s prediction error on a dedicated validation data set. I also excluded about 2/3 of the features (image pixels) from the model because they had near-zero variance, thereby improving both performance and runtime. Now, I’d like to add one last complication to the kNN model: weighting.
Recall that an unlabeled observation is assigned a class by a majority vote of its k nearest neighbors; the most common class among the neighbors wins. In the simplest case, each neighbor’s vote counts exactly the same. How democratic! However, what if you wanted to give more weight to the votes of neighbors that are closer to the unlabeled observation than those that are farther away? [Insert commentary on political corruption, lobbyists.] As it turns out, this is easily accomplished by means of a distance metric in a weighted kNN model, where neighbors in the training set are given more or less weight according to their distance from the unlabeled observation.
Up til now, I’ve used the simple Euclidean distance, which is essentially the straight line segment connecting two points, say, $p$ and $q$. Symbolically:
The 2-dimensional case should look familiar (think: Pythagorean theorem). In the parlance of non-parametric statistics, Euclidean distance corresponds to a uniform (or rectangular) kernel. [Technical note: kNN is a non-parametric learning algorithm, meaning that it makes no assumptions about the underlying distribution of the data. Contrast this with, say, a generalized linear model.] However, this isn’t the only option! We can use a triangular kernel, where the weight assigned to a neighbor decreases linearly with the distance from the test observation; Gaussian kernel, where the weighting function follows a Gaussian distribution; or many others. Here’s a nice plot from Wikipedia that overlays a variety of kernels:
Okay! Now I would like to optimize a kNN model for both the number of neighbors and the kernel weighting function. To do this, I use a different package in R —– kknn –— since the one I used last time —– the FNN package —– has a rectangular kernel only. The kknn package implements leave-one-out cross-validation, in which a model is trained on all but one example in the training set, tested on that one validation example, then repeated such that each example acts as the validation example once. Here’s the code:
# weighted k-nearest neighbors package library(kknn) # load the training data set train <- read.csv("train.csv", header=TRUE) # remove near-zero variance features library(caret) badCols <- nearZeroVar(train[, -1]) train <- train[, -(badCols+1)] # optimize knn for k=1:15 # and kernel=triangular, rectangular, or gaussian model <- train.kknn(as.factor(label) ~ ., train, kmax=15, kernel=c("triangular","rectangular","gaussian")) # print out best parameters and prediction error print(paste("Best parameters:", "kernel =", model$best.parameters$kernel, ", k =", model$best.parameters$k)) print(model$MISCLASS)
And here’s the resulting plot:
Models corresponding to the solid lines use all features in the training data, while those corresponding to the dashed lines use a reduced set without the near-zero variance features. The best models of both sets are marked by a black symbol. What do we see? Models trained on a reduced feature set perform significantly better than those trained on all features. (They also run faster!) Models with a triangular kernel perform better than those with rectangular or gaussian kernels. The best kNN model uses a triangular kernel, excludes near-zero variance features, and has k = 9, which we implement and run over the test data set like so:
# load test datasets test <- read.csv("test.csv", header=TRUE) # train the optimal kknn model model <- kknn(as.factor(label) ~ ., train, test, k=9, kernel="triangular") results <- model$fitted.values # save the class predictions in a column vector write(as.numeric(levels(results))[results], file="kknn_submission.csv", ncolumns=1)
As before, Kaggle uses prediction error to assess the model’s performance — so how does this one do? Much better!
Clearly, optimizing your statistical model pays off!
Although k-nearest neighbors is a straightforward and surprisingly powerful machine learning technique, other options out there easily outperform it. I’m going to take a break on the hand-written digits problem for now, but maybe I’ll come back to it wielding more powerful statistical tools… :)