A linear classifier is probably the simplest machine learning technique. Given some input vector , predict some output . One trains a vector of “weights” , that determine the behavior of the classifier. Given some new input , the prediction will be:

Here, is the indicator function– if is true, and -1 otherwise. So, is one of two classes.

Often the classifier will be written like , i.e. including a “bias” term . Here I’ll ignore this– if you want a bias term, just append a to every vector .

Now, it seems like nothing could be simpler than a linear classifier, but there is considerable subtlety in how to *train* a linear classifier. Given some training data , how do we find the vector of weights ?

There are various popular ways of doing this. Three popular techniques are:

- The Perceptron Algorithm
- Logistic Regression
- Support Vector Machines (Hinge loss)

The perceptron algorithm works, roughly speaking as follows: Go through the training data, element by element. If one datum is misclassified, slightly nudge the weights so that it is closer to being correct. This algorithm has the remarkable property that, *if the training data is linearly separable*, it will find a that correctly classifies it in a finite number of iterations. (The number of iterations that it takes depends on the “how separable” the data is, making fewer mistakes on data with a larger margin.)

Unfortunately, the assumption that the training data is linearly separable is basically never true in practice. There is still a bound in terms of how much points would need to be moved in order to be linearly separable, but it isn’t as strong as we might like. (I don’t know what the folklore is for how this tends to work in practice.)

Logistic regression and support vector machines take a different approach. There, a “loss function” is defined in terms of the weights, and one just optimizes the loss to find the best weights. (I’m ignoring regularization here for simplicity.) For logistic regression, the loss is

where , for a sigmoid function , and .

For a support vector machine, the loss is

where is if and 0 otherwise. This is a hinge loss. Notice that it will be zero if , or if that particular training element is comfortably on the correct side of the decision boundary. Otherwise, the “pain” is proportional to “how wrong” the classification is.

Both the hinge-loss and logistic regression are convex loss functions. This means that if we apply a nonlinear search procedure to find a local minima, that will also be the global minima.

These different loss functions are compared here.

**WHY NOT DO THE OBVIOUS THING?**

Now, the critical point about these three above methods is that none of them do the seemingly obvious thing: **find the vector of weights that has the lowest classification error on the training data**. Why not? Some would defend logistic regression or SVM for theoretical reasons (namely a meaningful probabilistic interpretation, and the reasonableness and theoretical guarantees of max-margin methods, respectively).

However, probably the more significant hurdle is computational considerations. Namely, the problem of finding the weights with lowest classification error is (in order of increasing horribleness) non-convex, non-differentiable, and NP-hard.

In fact it is NP-hard even to find an approximate solution, with worst-case guarantees. (See citation 2 below, which, interestingly, gives an approximate algorithm in terms of a property of the data.)

Nevertheless, if classification error is what we want, I don’t see how it makes sense to minimize some alternative loss function. As such, I decided today to try the following algorithm.

- Apply an SVM to get an initial solution.
- Apply heuristic search to minimize classification error, initialized to the solution of step 1.

Now, I am *well aware* of the problems with non-convex optimization. However, simply the fact that logistic regression or hinge loss is convex is not an argument in their favor. If theoretical considerations dictate that we minimize classification error, just substituting a different loss, and then refusing to look at the classification rate is highly questionable. Sure, that can lead to a convex optimization problem, but at the that’s because a different problem is being solved! The goal is accurate prediction of future data, not accurate minimization of a loss on the training data. If we use our best convex approximation to initialize the heuristic optimization of the loss we really want, we will never do worse.

**EXPERIMENT**

There are some heuristics available for doing this (Reference 3), but they seem a little expensive. I decided to try something really simple.

- Fit a classifier (by hinge loss or logistic regression).
- Throw out a few of the most badly misclassified points.
- Repeat until all remaining points are correctly classified.

The idea is that, by “giving up” on badly misclassified points, the boundary might be movable into a position where it correctly classifies other points. There is *no guarantee* in general that this will find a linear classifier with a lower misclassification rate, but it should not do worse.

To test this out, I got some data from the MNIST handwritten digit database. To make the problem harder, I took one class to be random images from either the set of 1’s or 2’s, while the other class was 3’s or 4’s. The images were downsampled to 7×7, and a constant of one added. So we are looking for a linear classifier in a 50 dimensional space.

The data was trained on a database of 10,000 examples, with a test set of 10,000 examples.

Here are the results where we throw out the worst 100 training points in step 2:

Here are the results when we throw out 20 at a time:

Here are the results when we throw out 5 at a time:

There seems to be a trade-off in step 2: Fewer models need to be fit if we throw out more points at a time. However, this seems to come with a small penalty in terms of terms of the classification error on the final model.

So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try.

References:

[1] The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge by Shai Ben-David

[2] Efficient Learning of Linear Perceptrons by Shai Ben-david, Hans Ulrich Simo

[3] Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent by L. Li and H.-T. Lin