# Logistic Regression

Note: These notes are about multi-class logistic regression, where we do not assume $y$ is binary.  The great majority of information out there on “logistic regression” assumes $y$ is binary, leading to a slightly simpler formalism.

INTRODUCTION

Logistic regression is one of the simplest classification methods. Given some real input vector ${\bf x}$, logistic regression produces a distribution over a discrete output $y$ as

$p(y|{\bf x})=\frac{1}{Z({\bf x})}\exp\bigl({\bf w}^{y}\cdot{\bf x}\bigr)$

$Z({\bf x})=\sum_{y}\exp({\bf w}^{y}\cdot{\bf x}).$

This seems a bit mysterious at first, but can be understood quite easily. The rough intuition of logistic regression is that we will create a vector of weights ${\bf w}^{y}$ for each possible class $y$. (For example, binary classification would have ${\bf w}^{0}$ and ${\bf w}^{1}$.) Roughly speaking, we want to set these weights so that if ${\bf w}^{y}\cdot{\bf x}$ is large (larger than ${\bf w}^{y'}\cdot{\bf x}$, $y'\not=y$), then $y$ is given high probability.

Notice that, in order to be a valid probability distribution, $p(y|{\bf x})$ must satisfy two conditions:

1) positivity: $p(y|{\bf x})\geq0$

2) normalization: $\sum_{y}p(y|{\bf x})=1$

Logistic regression basically takes the raw {}“scores” ${\bf w}^{y}\cdot{\bf x}$, and tuns them into a valid probability distribution in the simplest possible way. The $\exp$ function ensures that the scores are positive, while the normalization constant $Z({\bf x})$ ensures that the total probabilities sum to one.

TRAINING

Suppose we have some training data $\{(\hat{y},\hat{{\bf x}})\}$. How can we set the weights to fit well to the data? The traditional  way would be maximum conditional likelihood learning. That is, one sets the weights to maximize the quantity

$\sum_{(\hat{y},\hat{{\bf x}})}\log p(\hat{y}|\hat{{\bf x}})=\sum_{(\hat{y},\hat{{\bf x}})}\bigl({\bf w}^{\hat{y}}\cdot\hat{{\bf x}}-\log\sum_{y}\exp({\bf w}^{y}\cdot\hat{{\bf x}})\bigr).$

Notice, first of all, that this is a concave maximization, owing to the fact that {}“log-sum-exp” is a convex function. This means, of course, that a local optimization will find the globally optimal solution.

Now, how should we actually set the weights? There have been many methods developed. However, experiments seem to suggest that there are three main algorithms to consider:

1) Quasi-Newton Methods (BFGS, L-BFGS, DFP)
3) Stochastic Gradient Descent (or a sophisticated variant)

(These notes by Minka compare the first two to other methods. The third will probably be more competitive on datasets with extremely large numbers of data elements.)

All three of these methods simply require function evaluations plus the gradient of the conditional likelihood. This can be shown to be equal to

$\frac{d\log p(\hat{y}|\hat{{\bf x}})}{d{\bf w}^{y}}={\hat {\bf x}}\bigl(I[y=\hat{y}]-p(y|\hat{{\bf x}})\bigr).$

REGULARIZATION

In practice, of course, one usually adds a regularization term to the optimization.  The two most common are “Ridge regression” and “the lasso”.  Ridge regression would optimize

$\sum_{(\hat{y},\hat{{\bf x}})}\log p(\hat{y}|\hat{{\bf x}}) - \lambda \sum_y ||{\bf w}^y||^2,$

while the lasso would optimize

$\sum_{(\hat{y},\hat{{\bf x}})}\log p(\hat{y}|\hat{{\bf x}}) - \lambda \sum_y ||{\bf w}^y||_1.$