Linear Classifiers and Loss Functions

A linear classifier is probably the simplest machine learning technique. Given some input vector \bf x, predict some output y. One trains a vector of “weights” \bf w, that determine the behavior of the classifier. Given some new input \bf x, the prediction f will be:

f({\bf x}) = I[{\bf w}^T {\bf x}>0]

Here, I[\text{expr}] is the indicator function– 1 if \text{expr} is true, and -1 otherwise. So, y is one of two classes.

Often the classifier will be written like I[{\bf w}^T {\bf x}+b>0], i.e. including a “bias” term b. Here I’ll ignore this– if you want a bias term, just append a 1 to every vector \bf x.

Now, it seems like nothing could be simpler than a linear classifier, but there is considerable subtlety in how to train a linear classifier. Given some training data \{({\hat {\bf x}},{\hat y})\}, how do we find the vector of weights \bf w?

There are various popular ways of doing this. Three popular techniques are:

  • The Perceptron Algorithm
  • Logistic Regression
  • Support Vector Machines (Hinge loss)

The perceptron algorithm works, roughly speaking as follows: Go through the training data, element by element. If one datum is misclassified, slightly nudge the weights so that it is closer to being correct. This algorithm has the remarkable property that, if the training data is linearly separable, it will find a \bf w that correctly classifies it in a finite number of iterations. (The number of iterations that it takes depends on the “how separable” the data is, making fewer mistakes on data with a larger margin.)

Unfortunately, the assumption that the training data is linearly separable is basically never true in practice. There is still a bound in terms of how much points would need to be moved in order to be linearly separable, but it isn’t as strong as we might like. (I don’t know what the folklore is for how this tends to work in practice.)

Logistic regression and support vector machines take a different approach. There, a “loss function” is defined in terms of the weights, and one just optimizes the loss to find the best weights. (I’m ignoring regularization here for simplicity.) For logistic regression, the loss is

L = -\sum_{({\hat {\bf x}},{\hat y})} \log p({\hat y}|{\hat {\bf x}})

where p(y=1|{\bf x})=\sigma({\bf w}^T {\bf x}), for a sigmoid function \sigma, and p(y=-1|{\bf x})=1-p(y=1|{\bf x}).

For a support vector machine, the loss is

L = \sum_{({\hat {\bf x}},{\hat y})} (1-{\hat y} \cdot {\bf w}^T {\hat {\bf x}})_+

where (a)_+ is a if a>0 and 0 otherwise. This is a hinge loss. Notice that it will be zero if {\hat y} \cdot {\bf w}^T {\hat {\bf x}} > 1, or if that particular training element is comfortably on the correct side of the decision boundary. Otherwise, the “pain” is proportional to “how wrong” the classification is.

Both the hinge-loss and logistic regression are convex loss functions. This means that if we apply a nonlinear search procedure to find a local minima, that will also be the global minima.

These different loss functions are compared here.

WHY NOT DO THE OBVIOUS THING?

Now, the critical point about these three above methods is that none of them do the seemingly obvious thing: find the vector of weights {\bf w} that has the lowest classification error on the training data. Why not? Some would defend logistic regression or SVM for theoretical reasons (namely a meaningful probabilistic interpretation, and the reasonableness and theoretical guarantees of max-margin methods, respectively).

However, probably the more significant hurdle is computational considerations. Namely, the problem of finding the weights with lowest classification error is (in order of increasing horribleness) non-convex, non-differentiable, and NP-hard.

In fact it is NP-hard even to find an approximate solution, with worst-case guarantees. (See citation 2 below, which, interestingly, gives an approximate algorithm in terms of a property of the data.)

Nevertheless, if classification error is what we want, I don’t see how it makes sense to minimize some alternative loss function. As such, I decided today to try the following algorithm.

  1. Apply an SVM to get an initial solution.
  2. Apply heuristic search to minimize classification error, initialized to the solution of step 1.

Now, I am well aware of the problems with non-convex optimization. However, simply the fact that logistic regression or hinge loss is convex is not an argument in their favor. If theoretical considerations dictate that we minimize classification error, just substituting a different loss, and then refusing to look at the classification rate is highly questionable. Sure, that can lead to a convex optimization problem, but at the that’s because a different problem is being solved! The goal is accurate prediction of future data, not accurate minimization of a loss on the training data. If we use our best convex approximation to initialize the heuristic optimization of the loss we really want, we will never do worse.

EXPERIMENT

There are some heuristics available for doing this (Reference 3), but they seem a little expensive. I decided to try something really simple.

  1. Fit a classifier (by hinge loss or logistic regression).
  2. Throw out a few of the most badly misclassified points.
  3. Repeat until all remaining points are correctly classified.

The idea is that, by “giving up” on badly misclassified points, the boundary might be movable into a position where it correctly classifies other points. There is no guarantee in general that this will find a linear classifier with a lower misclassification rate, but it should not do worse.

To test this out, I got some data from the MNIST handwritten digit database. To make the problem harder, I took one class to be random images from either the set of 1’s or 2’s, while the other class was 3’s or 4’s. The images were downsampled to 7×7, and a constant of one added. So we are looking for a linear classifier in a 50 dimensional space.

The data was trained on a database of 10,000 examples, with a test set of 10,000 examples.

Here are the results where we throw out the worst 100 training points in step 2:

Here are the results when we throw out 20 at a time:

Here are the results when we throw out 5 at a time:

There seems to be a trade-off in step 2: Fewer models need to be fit if we throw out more points at a time. However, this seems to come with a small penalty in terms of terms of the classification error on the final model.

So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try.

References:

[1] The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge by Shai Ben-David

[2] Efficient Learning of Linear Perceptrons by Shai Ben-david, Hans Ulrich Simo

[3] Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent by L. Li and H.-T. Lin

Advertisements

An appalling and politically impossible way to improve voting outcomes.

In this post, I propose that elections in large democracies will give better outcomes by changing the voting rules as follows:

ONLY A SMALL RANDOM SUBSET OF THE POPULATION GETS A VOTE.

For each election, only a random subset of citizens, notified far in advance, can vote. Nothing else is changed.

Why would this help? First, we need to discuss the problem with voting in large democracies.

THE ECONOMIST’S CRITICISM OF LARGE DEMOCRACIES

Why should I spend the effort to become informed before voting? My vote has an extraordinarily low probability of changing the winner of a national election. Suppose that one candidate is much better for the country. Numerically, imagine that candidate A would be $100,000 “better” per citizen. The probability that my vote changes the election is very small. By this estimate, voters in New Mexico in 2004 had approximately a 1 in 10,000,000 chance of their vote determining the election. All other states had much smaller probabilities.

Thus, if I make a large investment of time to carefully evaluate the candidate’s positions, decide who is better for the country, and then vote, I will receive an expected return of $0.01!

Of course the issue with the above is that I am only comparing the costs and benefits to me. If one candidate is really $100,000 better per citizen, the total expected benefit to informing myself should be multiplied by the number of citizens (~300 million). In this case the total expected benefit to figuring out which candidate is better and then voting should be $3 million! (This number is surprisingly high because New Mexicans had the highest chance of influencing the election in 2004.) The problem is that almost none of that $3 million goes the person who put in the work. In economist lingo, educating oneself before voting has large “positive externalities”.

I think the central problem with large democracies is that voters are uninformed. Not because they are stupid or uneducated (1)– because they are rational.

(1) I strongly reject the idea that people are stupid.

THE COMMON-SENSE REBUTTAL OF THE ECONOMIST’S CRITICISM

The above assumes, of course, that people are strictly selfish beings only seeking to maximize their own utility. Most standard economics is built on this assumption, but in this case, it doesn’t look so good. The rebuttal is:

If the above was true, wouldn’t society descend into anarchy?

I find this rebuttal pretty compelling. Democracy does work fairly well, much better than the above argument would suggest. There are two explanations for this:

  1. Entertainment
  2. Altruism

First, people may choose to learn about candidates simply because they find it fun. Secondly, people simply choose to learn about candidates because they feel a sense of responsibility. They do care about how their vote affects other people, and act accordingly. That is altruism.

Still, I think we can’t dismiss the economist. Why? Ask yourself this:

Suppose you had the only vote. You get the choice, yourself, who should be the next US president. Would you learn more about the candidates than you have bothered to learn now?

I think any honest person would answer, “yes– a great deal more.”

THE ARGUMENT FOR A RANDOM SUBSET

The argument for leaving the decision up to a small number of people is clear. Those people will have a greater stake in the election, so they will take their choices more seriously. Would this really happen? I think so.

IOWA and NEW HAMPSHIRE

If you are not familiar, the US political parties choose their nominees through a staggered process. Different states do not vote simultaneously. Iowa and New Hampshire vote first, and thus wield huge influence on which candidate is chosen. As a result, the people in those states have developed a culture that takes the primaries very seriously.  It is fair to say that the average Iowa primary voter is better informed than the average voter in a state that votes on Super Tuesday.

INFLUENCE OF MONEY

Of course, is that it is totally unfair that people in two states have so large a choice in who is president! The obvious solution to this would be to move to a simultaneous, national primary. Ignoring the fact that we will take New Hampshire’s primary from their cold, dead hands, would we want this?

Candidates with little money can compete and win in Iowa and New Hampshire through “retail politics”, launching them to a national campaign. With a national primary, it would be even more difficult for a poorly funded candidate to win.

10,000 people

Now, imagine that votes were given to a random subset of the population, say 10,000 people. What would happen?

  • The candidates would personally meet each voter. Those voters would demand and get meetings with the candidates in small groups. The candidates would be forced to actually answer the question asked of them. In a year, to meet each voter, the candidate would need to talk to only 27.39 voters per day.
  • Each voter would both have a personal interest in the election feel a strong sense of personal responsibility for the outcome. With only 10,000 voters, every vote counts. If you have a real chance of changing the winner, you have a real interest in making your choice correctly.
  • The influence of money would be reduced. Initially, I thought it might be reduced almost to zero, but upon reflection this probably wouldn’t happen. The reason is that the voters would be subject to intense lobbying from their friends and neighbors. Thus, candidates might still choose to run TV ads in the hopes of indirect influence on voters.
  • Unfair negative attacks would lose impact, while fair negative attacks would gain impact. Think of how negative attacks work now: Suppose lying candidate A runs an 30 second TV ad falsely accusing honest candidate B of shooting penguins for fun. B then runs a 30 second ad pointing out that this is nonsense. I, the disengaged voter, am left with a cloudy doubt about the truth– perhaps B likes to kill penguins, perhaps not. I am also left more cynical about politics, leading to my further disengagement. Under my proposal, imagine that A makes the same attack. I then meet with B, who personally explains their deep love for the penguin species. B’s staff sends me information on all the pro-penguin legislation B has supported. I am now left only with a distaste for A’s dishonesty.

CONCLUSION

The biggest problems with this suggestion are:

  1. It is undemocratic.

  2. It will never happen.

“Disenfranchise 99.9% of the population– are you insane?” Well, maybe. However, I believe that the trouble with democracy is that people already feel and are disenfranchised. You are a fool to spend hours and days educating yourself, when your vote will just be a drop in a sea of uninformed votes.

10,000 highly enfranchised voters would do a better job than 200,000,000 distracted.

Most discussion of voting systems focuses on game theoretic issues. Assuming the whole population knows the candidate they want, and voters vote strategically, how can the system be structured to pick the candidate with the highest appeal? There are mathematical results that no voting system exists that satisfy some intuitive criteria, in the case of more than two candidates.

I doubt the importance of this. That model ignores the large cost to a voter of choosing to inform themselves about the candidates. I think instead, voting systems should be designed to give voters an incentive to pay that cost.

Postscript: The ancient Greeks apparently often filled offices through the Sortition method– randomly assign the office to a citizen. My proposal here could be seen as these two steps:

  1. Expanding the electoral college to 10,000 members.

  2. Choosing those member by Sortition.