Logistic Regression
Note: These notes are about multi-class logistic regression, where we do not assume is binary. The great majority of information out there on “logistic regression” assumes
is binary, leading to a slightly simpler formalism.
INTRODUCTION
Logistic regression is one of the simplest classification methods. Given some real input vector , logistic regression produces a distribution over a discrete output
as
This seems a bit mysterious at first, but can be understood quite easily. The rough intuition of logistic regression is that we will create a vector of weights for each possible class
. (For example, binary classification would have
and
.) Roughly speaking, we want to set these weights so that if
is large (larger than
,
), then
is given high probability.
Notice that, in order to be a valid probability distribution, must satisfy two conditions:
1) positivity:
2) normalization:
Logistic regression basically takes the raw {}“scores” , and tuns them into a valid probability distribution in the simplest possible way. The
function ensures that the scores are positive, while the normalization constant
ensures that the total probabilities sum to one.
TRAINING
Suppose we have some training data . How can we set the weights to fit well to the data? The traditional way would be maximum conditional likelihood learning. That is, one sets the weights to maximize the quantity
Notice, first of all, that this is a concave maximization, owing to the fact that {}“log-sum-exp” is a convex function. This means, of course, that a local optimization will find the globally optimal solution.
Now, how should we actually set the weights? There have been many methods developed. However, experiments seem to suggest that there are three main algorithms to consider:
1) Quasi-Newton Methods (BFGS, L-BFGS, DFP)
2) Nonlinear Conjugate Gradient
3) Stochastic Gradient Descent (or a sophisticated variant)
(These notes by Minka compare the first two to other methods. The third will probably be more competitive on datasets with extremely large numbers of data elements.)
All three of these methods simply require function evaluations plus the gradient of the conditional likelihood. This can be shown to be equal to
REGULARIZATION
In practice, of course, one usually adds a regularization term to the optimization. The two most common are “Ridge regression” and “the lasso”. Ridge regression would optimize
while the lasso would optimize