Note: These notes are about *multi-class* logistic regression, where we do not assume is binary. The great majority of information out there on “logistic regression” assumes is binary, leading to a slightly simpler formalism.

**INTRODUCTION**

Logistic regression is one of the simplest classification methods. Given some real input vector , logistic regression produces a distribution over a discrete output as

This seems a bit mysterious at first, but can be understood quite easily. The rough intuition of logistic regression is that we will create a vector of weights for each possible class . (For example, binary classification would have and .) Roughly speaking, we want to set these weights so that if is large (larger than , ), then is given high probability.

Notice that, in order to be a valid probability distribution, must satisfy two conditions:

1) positivity:

2) normalization:

Logistic regression basically takes the raw {}“scores” , and tuns them into a valid probability distribution in the simplest possible way. The function ensures that the scores are positive, while the normalization constant ensures that the total probabilities sum to one.

**TRAINING**

Suppose we have some training data . How can we set the weights to fit well to the data? The traditional way would be maximum conditional likelihood learning. That is, one sets the weights to maximize the quantity

Notice, first of all, that this is a concave maximization, owing to the fact that {}“log-sum-exp” is a convex function. This means, of course, that a local optimization will find the globally optimal solution.

Now, how should we actually set the weights? There have been many methods developed. However, experiments seem to suggest that there are three main algorithms to consider:

1) Quasi-Newton Methods (BFGS, L-BFGS, DFP)

2) Nonlinear Conjugate Gradient

3) Stochastic Gradient Descent (or a sophisticated variant)

(These notes by Minka compare the first two to other methods. The third will probably be more competitive on datasets with extremely large numbers of data elements.)

All three of these methods simply require function evaluations plus the gradient of the conditional likelihood. This can be shown to be equal to

**REGULARIZATION**

In practice, of course, one usually adds a regularization term to the optimization. The two most common are “Ridge regression” and “the lasso”. Ridge regression would optimize

while the lasso would optimize

### Like this:

Like Loading...