# Backpropagation

A neural network can be described as a series of altering linear transformations and elementwise nonlinearities.

${\bf s}_n=W_n{\bf x}_{n-1}$

${\bf x}_n=\sigma({\bf s}_n)$

$\sigma$ is a “sigmoid function”, usually $1/(1+\exp(-s))$ or $\tanh(s)$.

You present input ${\bf x}_0$ and get output ${\bf x}_N$.  You have some loss function $L$ that says how much you like that particular ouput on that input.  Backpropagation is an algorithm for calculating the derivatives of $L$ with respect to all the weight matrices $W_n$.  Backprop is a special case of reverse-mode automatic differentiation.  However, it is simple enough to implement manually that it is often worth doing so for efficiency.

The loss function will directly give the derivatives with respect to the output.  This will be either (depending on the application)

$\displaystyle{\frac{dL}{d{\bf x}_N}}.$ or $\displaystyle{\frac{dL}{d{\bf s}_N}}.$

The algorithm follows easily from applying the following three rules (Here, $\odot$ denotes the elementwise product):

$\displaystyle{\frac{dL}{d{\bf s}_n} = \frac{dL}{d{\bf x}_n} \odot \sigma'({\bf s}_n)}$

$\displaystyle{\frac{dL}{dW_n} = \frac{dL}{d{\bf s}_n}{{\bf x}_{n-1}}^T}$

$\displaystyle{\frac{dL}{d{\bf x}_{n-1}} = {W_n}^T\frac{dL}{d{\bf s}_n}}$

Notice that applying these rules in reverse will have the same complexity as the original “forward propagation”.