# What Gauss-Seidel is Really Doing

I’ve been reading Alan Sokal’s lecture notes “Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms” today. Once I learned to take the word “Hamiltonian” and mentally substitute “function to be minimized”, they are very clearly written.

Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so clear, I feel a little cheated that it was never explained to me before. Let’s start with the typical (confusing!) explanation of Gauss-Seidel (as in, say, Wikipedia). You want to solve some linear system

$A{\bf x}={\bf b}$,

for $\bf x$. What you do is decompose $A$ into a diagonal matrix $D$, a strictly lower triangular matrix $L$, and a strictly upper triangular matrix $U$, like

$A=L+D+U$.

To solve this, you iterate like so:

${\bf x} \leftarrow (D+L)^{-1}({\bf b}-U{\bf x})$.

This works when $A$ is symmetric positive definite.

Now, what the hell is going on here? The first observation is that instead of inverting $A$, you can think of minimizing the function

$\frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T{\bf b}$.

(It is easy to see that the global minimum of this is found when $A{\bf x}={\bf b}$.) Now, how to minimize this function? One natural way to approach things would be to just iterate through the coordinates of ${\bf x}$, minimizing the function over each one. Say we want to minimize with respect to $x_i$. So, we are going to make the change $x_i \leftarrow x_i + d$. Thus, we want to minimize (with respect to d), this thing:

$\frac{1}{2} ({\bf x}+d {\bf e_i})^T A ({\bf x}+d {\bf e_i}) - ({\bf x}+d {\bf e_i})^T {\bf b}$

Expanding this, taking the derivative w.r.t. $d$, and solving gives us (where ${\bf e_i}$ is the unit vector in the $i$th direction)

$d = \frac{{\bf e_i}^T({\bf b}-A{\bf x})}{{\bf e_i}^TA{\bf e_i}}$,

or, equivalently,

$d = \frac{1}{A_{ii}}(b_i-A_{i*}{\bf x})$.

So, taking the solution at one timestep, ${\bf x}^t$, and solving for the solution ${\bf x}^{t+1}$ at the next timestep, we will have, for the first index,

$x^{t+1}_1 = x^t_1 + \frac{1}{A_{11}}(b_1-A_{1*}x^t)$.

Meanwhile, for the second index,

$x^{t+1}_2 = x^t_2 + \frac{1}{A_{22}}(b_2-A_{2*}(x^t-{\bf e_1} x^t_1 +{\bf e_1} x^{t+1}_1 ))$.

And, more generally

$x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-A_{n*}(\sum_{i.

Now, all is a matter of cranking through the equations.

$x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-\sum_{i

$D {\bf x}^{t+1} = D {\bf x}^t + {\bf b}- L x^{t+1} - (D+U) {\bf x}^t$

$(D+L) {\bf x}^{t+1} = {\bf b}-U {\bf x}^t$

Which is exactly the iteration we started with.

The fact that this would work when $A$ is symmetric positive definite is now obvious– that is when the objective function we started with is bounded below.

The understanding that Gauss-Seidel is just a convenient way to implement coordinate descent also helps to get intuition for the numerical properties of Gauss-Seidel (e.g. quickly removing "high frequency error", and very slowly removing "low frequency error").