## Linear Classifiers and Loss Functions

A linear classifier is probably the simplest machine learning technique. Given some input vector $\bf x$, predict some output $y$. One trains a vector of “weights” $\bf w$, that determine the behavior of the classifier. Given some new input $\bf x$, the prediction $f$ will be:

$f({\bf x}) = I[{\bf w}^T {\bf x}>0]$

Here, $I[\text{expr}]$ is the indicator function– $1$ if $\text{expr}$ is true, and -1 otherwise. So, $y$ is one of two classes.

Often the classifier will be written like $I[{\bf w}^T {\bf x}+b>0]$, i.e. including a “bias” term $b$. Here I’ll ignore this– if you want a bias term, just append a $1$ to every vector $\bf x$.

Now, it seems like nothing could be simpler than a linear classifier, but there is considerable subtlety in how to train a linear classifier. Given some training data $\{({\hat {\bf x}},{\hat y})\}$, how do we find the vector of weights $\bf w$?

There are various popular ways of doing this. Three popular techniques are:

• The Perceptron Algorithm
• Logistic Regression
• Support Vector Machines (Hinge loss)

The perceptron algorithm works, roughly speaking as follows: Go through the training data, element by element. If one datum is misclassified, slightly nudge the weights so that it is closer to being correct. This algorithm has the remarkable property that, if the training data is linearly separable, it will find a $\bf w$ that correctly classifies it in a finite number of iterations. (The number of iterations that it takes depends on the “how separable” the data is, making fewer mistakes on data with a larger margin.)

Unfortunately, the assumption that the training data is linearly separable is basically never true in practice. There is still a bound in terms of how much points would need to be moved in order to be linearly separable, but it isn’t as strong as we might like. (I don’t know what the folklore is for how this tends to work in practice.)

Logistic regression and support vector machines take a different approach. There, a “loss function” is defined in terms of the weights, and one just optimizes the loss to find the best weights. (I’m ignoring regularization here for simplicity.) For logistic regression, the loss is

$L = -\sum_{({\hat {\bf x}},{\hat y})} \log p({\hat y}|{\hat {\bf x}})$

where $p(y=1|{\bf x})=\sigma({\bf w}^T {\bf x})$, for a sigmoid function $\sigma$, and $p(y=-1|{\bf x})=1-p(y=1|{\bf x})$.

For a support vector machine, the loss is

$L = \sum_{({\hat {\bf x}},{\hat y})} (1-{\hat y} \cdot {\bf w}^T {\hat {\bf x}})_+$

where $(a)_+$ is $a$ if $a>0$ and 0 otherwise. This is a hinge loss. Notice that it will be zero if ${\hat y} \cdot {\bf w}^T {\hat {\bf x}} > 1$, or if that particular training element is comfortably on the correct side of the decision boundary. Otherwise, the “pain” is proportional to “how wrong” the classification is.

Both the hinge-loss and logistic regression are convex loss functions. This means that if we apply a nonlinear search procedure to find a local minima, that will also be the global minima.

These different loss functions are compared here.

WHY NOT DO THE OBVIOUS THING?

Now, the critical point about these three above methods is that none of them do the seemingly obvious thing: find the vector of weights ${\bf w}$ that has the lowest classification error on the training data. Why not? Some would defend logistic regression or SVM for theoretical reasons (namely a meaningful probabilistic interpretation, and the reasonableness and theoretical guarantees of max-margin methods, respectively).

However, probably the more significant hurdle is computational considerations. Namely, the problem of finding the weights with lowest classification error is (in order of increasing horribleness) non-convex, non-differentiable, and NP-hard.

In fact it is NP-hard even to find an approximate solution, with worst-case guarantees. (See citation 2 below, which, interestingly, gives an approximate algorithm in terms of a property of the data.)

Nevertheless, if classification error is what we want, I don’t see how it makes sense to minimize some alternative loss function. As such, I decided today to try the following algorithm.

1. Apply an SVM to get an initial solution.
2. Apply heuristic search to minimize classification error, initialized to the solution of step 1.

Now, I am well aware of the problems with non-convex optimization. However, simply the fact that logistic regression or hinge loss is convex is not an argument in their favor. If theoretical considerations dictate that we minimize classification error, just substituting a different loss, and then refusing to look at the classification rate is highly questionable. Sure, that can lead to a convex optimization problem, but at the that’s because a different problem is being solved! The goal is accurate prediction of future data, not accurate minimization of a loss on the training data. If we use our best convex approximation to initialize the heuristic optimization of the loss we really want, we will never do worse.

EXPERIMENT

There are some heuristics available for doing this (Reference 3), but they seem a little expensive. I decided to try something really simple.

1. Fit a classifier (by hinge loss or logistic regression).
2. Throw out a few of the most badly misclassified points.
3. Repeat until all remaining points are correctly classified.

The idea is that, by “giving up” on badly misclassified points, the boundary might be movable into a position where it correctly classifies other points. There is no guarantee in general that this will find a linear classifier with a lower misclassification rate, but it should not do worse.

To test this out, I got some data from the MNIST handwritten digit database. To make the problem harder, I took one class to be random images from either the set of 1’s or 2’s, while the other class was 3’s or 4’s. The images were downsampled to 7×7, and a constant of one added. So we are looking for a linear classifier in a 50 dimensional space.

The data was trained on a database of 10,000 examples, with a test set of 10,000 examples.

Here are the results where we throw out the worst 100 training points in step 2:

Here are the results when we throw out 20 at a time:

Here are the results when we throw out 5 at a time:

There seems to be a trade-off in step 2: Fewer models need to be fit if we throw out more points at a time. However, this seems to come with a small penalty in terms of terms of the classification error on the final model.

So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try.

References:

[2] Efficient Learning of Linear Perceptrons by Shai Ben-david, Hans Ulrich Simo

[3] Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent by L. Li and H.-T. Lin

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

### 3 Responses to Linear Classifiers and Loss Functions

1. Mark Reid says:

Interesting post. It’s always good to question the prevailing thinking and try simple experiments like the one you’ve proposed.

One part that is unclear in your experiment is exactly how you trained the linear classifier. Was it logistic regression or an SVM with hinge loss? If it was the latter, was it using a hard or soft margin?

I ask because, at least conceptually, your throwing out of examples on the training set is similar to the use of slack variables to admit misclassifications in soft margin SVMs. With the nu-SVM you can even control the fraction of allowable errors on the training set.

I’d be curious to see whether the same reduction in test set error can be achieved by increasing the nu parameter on a soft-margin SVM.

2. justindomke says:

Thanks for the comment. I couldn’t remember how I fit the models, but looking at it today, it looks like it was the hinge loss. (I wouldn’t call it an SVM, just a straightforward primal minimization of $\sum_{({\hat {\bf x}},{\hat y})} (1-{\hat y} \cdot {\bf w}^T {\hat {\bf x}})_+$).

Throwing out the examples is similar to slack variables, but I don’t think it is quite the same. Slack variables still incur increasing cost the further away the misclassified points are from the decision boundary, while the misclassification error only cares which side of the decision boundary points are on. The picture I am thinking about is something like this, where “+” indicates an example of one class, and “-” one of another. (This is supposed to be a picture of some training data in 1-D)

+++ + - - - - - ............................................ +

Clearly, the way to minimize misclassification error is to stick the decision boundary between the two groups on the left. A convex loss will be worried about the horrible margin for the “+” on the right. (Though it might still put the decision boundary in the right place, depending on the distances, and how many non-outlying points there are, etc.)

I need to look at nu-SVMs in more detail. Thanks for the pointer.