# The second and third best features of LyX you aren’t using.

LyX is a WYSIWYG editor for latex files. It’s a little bit clunky to use at first, and isn’t perfect (thank you, open source developers– I’m not ungrateful!) but after becoming familiar with it, it’s probably the single piece of software that has most improved my productivity. I like it so much I use it not just for papers but often as a scratchpad for math. Many times during random discussions I’ve used it to quickly bang out some equations and after seeing how fast it was, others immediately switched to using it.

Anyway, while macros may be the best feature you LyX aren’t using, I recently discovered another couple excellent ones I wasn’t familiar with after years of use so I thought I’d publicize. Specifically, I’ve always hated the process of including explanatory figures into LyX. Exporting plots from an experiment is tricky to improve, but when trying to create explanatory graphs, I’ve always hated the process.

Before, I thought the options were.

1) Create the graph in an external program. This is fine, of course, but is quite inconvenient when you want to go back and revise it. The external program usually saves it in some other format, so you have to open the graphic in open it again, revise it, export it to .pdf (or whatever), then open the document in LyX, compile it. Then, when you don’t like the way it looks, you have to repeat the whole process. It works, but it’s not efficient, since you can’t edit the content in place. (Which is the whole point of using a WYSIWYG editor in the first place– remove the need for thinking about anything but content.)

2) Write the graphics directly in LyX in a language like TikZ. This is more “in-place” in that you don’t have external files to find and manipulate. However, I find TikZ to be quite painful to get right with many re-compilations necessary. If the TikZ document is in place each requires a full compilation of the document. This is hilariously slow when making something like Beamer slides. Further, this totally violates the whole point of WYSIWYG since you’re looking at code, rather than the output.

There are better ways! I’ve wasted countless hours not being aware of these.

## Preview Boxes

First, LyX has a beautiful feature of “preview boxes”. Take the following very simple TikZ code, which just draws a square:

 \begin{tikzpicture} \draw[red] (0,0) -- (0,1); \draw[green] (0,1) -- (1,1); \draw[red] (1,1) -- (1,0); \draw[blue] (1,0) -- (0,0); \end{tikzpicture} 

Typically, I’d include this in LyX files by inserting a raw tex box:

And then putting the the TikZ code inside:

This is OK, but has the disadvantages from (2) above. The code can be huge, if I have a lot of graphics, I can’t tell what corresponds to what, and I have to do a (slooooooow) recompile of the whole document to see what it looks like.

However, if you just add a “preview box”:

You get something that looks like this:

So far, so pointless, right? However, when you deselect, LyX shows the graphic in-place:

You can then click on it to expand the code. This solves most of the problems: You can see what you are doing at a glance, and you don’t need to recompile the whole document to do it.

## Native SVG support

Newer versions of LyX also natively support SVG files. You first have to create the file externally using something like Inkscape, which itself saves directly to the SVG format. Then, you can include it in LyX by doing Insert->File->External Material:

And then selecting the SVG file:

Again, LyX will show it in-place and (if LyX is configured correctly…) correctly output vector graphics in the final document.

What’s even better is that LyX can automatically open the file in the external editor for you. If you right click, you can “edit externally”:

Then the external editor will automatically open the file. You can then save it with a keystroke and go back to LyX. No hunting around for the file, no cycles of exporting to other formats, and you see exactly what the final output will look like at all stages. You can really tell that LyX was created by people using it themselves.

## Bonus feature

This one is described well-enough already, but helps a lot in big documents: you can click on a point in a generated .pdf and automatically have LyX sync the editor to the corresponding point in the file.

# Reducing Sigmoid computations by (at least) 88.0797077977882%

A classic implementation issue in machine learning is reducing the cost of computing the sigmoid function

$\sigma(a) = \frac{1}{1+\exp(-a)}$.

Specifically, it is common to profile your code and discover that 90% of the time is spent computing the $\exp$ in that function.  This comes up often in neural networks, as well as in various probabilistic architectures, such as sampling from Ising models or Boltzmann machines.  There are quite a few classic approximations to the function, using simple polynomials, etc. that can be used in neural networks.

Today, however, I was faced with a sampling problem involving the repeated use of the sigmoid function, and I noticed a simple trick that could reduce the number of sigmoids by about 88% without introducing any approximation.  The particular details of the situation aren’t interesting, but I repeatedly needed to do something like the following:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $r < \sigma(a)$
4.   Output $+1$
5. Else
6.     Output $-1$

Now, let’s assume for simplicity that $a$ is positive.  (Otherwise, sample using $-a$ and then switch the sign of the output.)  There are two observations to make:

1. If $a$ is large, then you are likely to output $+1$
2. Otherwise, there are easy upper and lower bounds on the probability of outputting $+1$

This leads to the following algorithm:

1. Input $a \in \Re$
2. Compute a random number $r \in [0,1]$
3. If $a \geq 2$
4.     If $r \leq 0.880797077977882$ or $r \leq \sigma(a)$
5.         Output $+1$
6.     Else
7.         Output $-1$
8. Else
9.     If $r > .5 + a/4$
10.         Output $-1$
11.     Else if $r \leq .5 + a/5.252141128658$ or $r \leq \sigma(a)$
12.         Output $+1$
13.     Else
14.         Output $-1$

The idea is as follows:

1. If $a\geq 2$, then we can lower-bound the probability of outputting +1 by a pre-computed value of $\sigma(2)\approx0.8807...$, and short-circuit the computation in many cases.
2. If $a\leq 2$, then we can upper bound the sigmoid function by $.5+a/4$.
3. If $a\leq 2$, then we can also lower bound by $.5+a/5.252141...$ respectively.  (This constant was found numerically).

The three cases are illustrated in the following figure, where the input $a$ is on the x-axis, and the random number $r$ is on the y-axis.

Since, for all $a$ at least a fraction $\sigma(2)\approx.8807$ of the numbers will be short-circuited, sigmoid calls will be reduced appropriately.  If $a$ is often near zero, you will do even better.

Obviously, you can take this farther by adding more cases, which may or may not be helpful, depending on the architecture, and the cost of branching vs. the cost of computing an $\exp$.

# Quotients

It seems to me that thinking of quotients as a fundamental operator is usually painful and unnecessary when the objects are almost anything other than real (or rational) numbers. Instead it is better to think of a quotient as a combination of the reciprocal and the product. A good example of this is complex numbers. Suppose that

$z=a+bi$
$w=c+di.$

Then, the usual rule for the quotient is that

$\displaystyle{z/w = \frac{ac+bd}{c^2+d^2} + i\frac{bc-ad}{c^2+d^2}}.$

This qualifies as non-memorizable. On the other hand, take the reciprocal of $w$

$\displaystyle{1/w = \frac{c-di}{c^2+d^2}}$.

This is simple enough (“the complex conjugate divided by the squared norm”), and we recover the rule for the quotient easily enough by multiplying with $z$.

The same thing holds true for derivatives. I’ve never been able to remember that quotient rule from high-school; Namely that if $f(x)=g(x)/h(h)$, then

$\displaystyle{f'(x) = \frac{h(x)g'(x)-h'(x)g(x)}{h(x)^2}}$

Ick. Instead, better to note that if $r(x) = 1/h(h)$ then

$\displaystyle{r'(x) = \frac{-h'(x)}{h(x)^2}},$

along with the standard rule for differentiating products, so that if $f(x)=g(x)/h(x)=g(x)r(x)$, then

$\displaystyle{f'(x) = g(x)r'(x)+g'(x)r(x)}$.

Another case would be the “matrix quotient” $B C^{-1}$. Of course, everyone already thinks of the matrix multiplication and inverse as separate operations– to do otherwise would be horrible– but I think that just proves the point…

(Although, I assume that computing $BC^{-1}$ as a single operation would be more numerically stable than first taking an explicit inverse. This might mean something to people who feel that mathematical notation ought to suggest an obvious stable implementation in IEEE floating point (if any).)

# Matrix Calculus

Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka’s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker’s Matrix Differential Calculus with Applications in Statistics and Econometrics.

In making the notes, I used a couple innovations, which I am still debating the wisdom of. The first is the rule for calculating derivatives of scalar-valued functions of a matrix input $f(X)$. Traditionally, this is written like so:

if $dy = \text{tr}(A^T dX)$ then $\frac{dy}{dX} = A$.

I initially found the presence of the trace here baffling. However, there is the simple rule that

$\text{tr}(A^T B) = A \cdot B$

where $\cdot$ is the matrix inner product. This puts the rule in the much more intuitive (to me!) form:

if $dy = A \cdot dX$ then $\frac{dy}{dX} = A$.

This seems more straightforward, but it comes at a cost. When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices. This is easy to do using the standard trace identities like

$\text{tr}(ABC)=\text{tr}(CAB)$.

If we are to work with inner-products, we will require a similar set of rules. It isn’t too hard to show that there are “dual” identities like

$A \cdot (BC) = B \cdot (AC^T) = C \cdot (B^T A)$

which allow a similar shuffling with dot products. Still, these are certainly less easy to remember.

There are also a set of other rules that seem to be needed in practice, but aren’t included in standard texts. For example, if $R$ is a function that is applied elementwise to a matrix or vector (e.g. $\sin$), then

$d(R(F)) = R'(F(X)) \odot dF$

where $\odot$ is the elementwise product. This then requires other (very simple) identities for getting rid of the elementwise product, such as

${\bf x}\odot{\bf y} = \text{diag}({\bf x}) {\bf y} = \text{diag}({\bf y}) {\bf x}$.

Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products. (This issue comes up because I prefer a “all vectors are column vectors” convention) The never-ending debate of if we should write

${\bf x} \cdot {\bf y}$

or

${\bf x}^T {\bf y}$

seems to have particular importance here, and I’m not sure of the best choice.

# Automatic Differentiation Without Compromises

Automatic differentiation is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program. This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives. Some systems like cvx do a great job of recognizing simple problem types (linear programs, cone programs, etc.), but can’t handle arbitrary functions.

At present, automatic differentiation involves a degree of pain. Typically, the user needs to write C or C++ or Fortran code and augment the code with “taping” commands to get it to work. There is also usually a significant computational overhead.

In a just world, there would be no pain, and there would be no computational overhead. In a just world, reverse-mod autodiff would work like this:

STEP 1 The user writes the function they want to optimize in a convenient high level language like Python.

This is easy, but slow, and provides no derivatives.

Consider a function that sums the elements in a matrix product. The user (apparently unaware that numpy provides matrix multiplication) writes:

    def fun(A,B):
rez = 0
for i in range(A.shape[0]):
for j in range(B.shape[1]):
rez += dot(A[i,:],B[:,j])
return rez


STEP 2 The user feeds their high level function into a library. This library uses operator overloading magic to build an expression graph representation of the function. This expression graph is then used to generate efficient machine ode for both the function, and its derivatives.

The user merely should have to do something like

    A = numpy.random.rand(3,5)
B = numpy.random.rand(5,4)
dfun = compile_fun(fun,A,B)


They could then compute the function and derivatives (all in compiled code) using

    F,[dFdA,dFdB] = dfun(A,B)


Thus, the user gets everything, with out having to compromise: maximum performance, and maximum convenience.

Since this is how things should work, I spent some time this past summer writing a library that does. The above code is actually a working example, calling the compile_fun method– the only method a consumer of the library needs to know about.  This library comes tantalizingly close to my goals.

Another simple example, mixing matrices and scalars:

def fun(A,B,a):
return sum(sum((A*a-B)**2))
a = 2.
A = random.rand(15,22)
B = random.rand(15,22)
import etree
dfun = etree.compile_fun(fun,A,B,a)
F,[dFdA,dFdB,dFda] = dfun(A,B,a)


Consider the above matrix multiplication example, with 60×60 matrices. The library can generate the expression graph, transform that into a simple bytecode, than transform that bytecode into C code acceptably quickly– 11.97 seconds on my machine.  The code then runs very fast:  .0086 seconds as opposed to .0254 seconds for just running the original function or .0059 seconds for calling numpy’s matrix multiplication routine dot(A,B). (Which, of course, does not provide derivatives!)

There is, inevitably, one large problem:  horrific compilation times on large functions.  To take the C code and transform it into machine code, gcc takes 1116 seconds.  Why, you ask?  The reason is because gcc is trying to compile a single 36.5 MB function:

#include "math.h"
void rock(double* A, double* D){
A[12800]=A[0] * A[6400];
A[12801]=A[1] * A[6480];
A[12802]=A[12800] + A[12801];
A[12803]=A[2] * A[6560];
A[12804]=A[12802] + A[12803];
A[12805]=A[3] * A[6640];
// thousands of pages of same
D[12801] = D[12802];
D[1] += D[12801] * A[6480];
D[6480] += D[12801] * A[1];
D[0] += D[12800] * A[6400];
D[6400] += D[12800] * A[0];
}


Though this is very simple code, it seems that gcc still uses some super-linear algorithms, so it can’t handle large functions.

(To be clear, for small or medium sized problems, the code compiles very quickly.)

Now, clearly, I am doing something wrong. Frankly I am writing this largely in the hope that someone who actually knows something about compilers and programming languages can enlighten me.

1) Use a better compiler.  I’ve tried this.  I can turn off gcc’s optimization, which decreases compilation times, though to a still very slow level.  Alternatively, I could use a fast compiler.  tcc flies right through the code.  Unfortunately, the machine code that tcc generates is very slow– I think due to poor register allocation.

2) Don’t generate intermediate C code– directly generate machine code or assembly.  This would certainly work, but there is no way I am up to it. Maybe it would make sense to target the JVM?

3) Use one of the newfangled just in time virtual machine things, like LLVM or libJIT.  This seems promising, but after quite a bit of research I’m still really unsure about how well this is likely to work, or how to proceed.

4) Write a fast interpreter, and just pass the bytecode to this.  I’ve tried this as well.  This is basically the strategy used by some autodiff packages, e.g. cppad. This can get reasonable performance, but will never compete with native code. The reason is– I think– that the interpreter is basically a giant switch statement, and all the branching screws up the pipeline on the CPU.

Anyway, the code is available as etree.py, and a tiny file mycode.h. (Let’s say it’s available under the MIT license, and the condition that you not laugh too much at my programming skills.) To try it out, just start python and do

import etree
etree.runtests()


At this point, the library is useful (I use it!) but only for relatively small problems, containing no more than, say, a few thousand variables.

If anyone has ideas for better compiler flags, or knows about the feasibility of targeting LLVM or libJIT instead of C code, please get in touch.

Disclaimer: There are a bunch of minor issues with the library. These are all fixable, but it isn’t really worth the effort unless the compilation times can be dealt with.

• The function cannot branch. (No if statements).
• If you want to multiply a matrix by a scalar, the scalar must come after the matrix.
• I know it works on 32 bit ubuntu with everything installed via synaptic package manager, and on 64 bit OS X, simply using Sage (with this patch to make scipy/weave work). If you want to play around with this, but aren’t a python programmer, I would suggest using Sage.

# What Gauss-Seidel is Really Doing

I’ve been reading Alan Sokal’s lecture notes “Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms” today. Once I learned to take the word “Hamiltonian” and mentally substitute “function to be minimized”, they are very clearly written.

Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so clear, I feel a little cheated that it was never explained to me before. Let’s start with the typical (confusing!) explanation of Gauss-Seidel (as in, say, Wikipedia). You want to solve some linear system

$A{\bf x}={\bf b}$,

for $\bf x$. What you do is decompose $A$ into a diagonal matrix $D$, a strictly lower triangular matrix $L$, and a strictly upper triangular matrix $U$, like

$A=L+D+U$.

To solve this, you iterate like so:

${\bf x} \leftarrow (D+L)^{-1}({\bf b}-U{\bf x})$.

This works when $A$ is symmetric positive definite.

Now, what the hell is going on here? The first observation is that instead of inverting $A$, you can think of minimizing the function

$\frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T{\bf b}$.

(It is easy to see that the global minimum of this is found when $A{\bf x}={\bf b}$.) Now, how to minimize this function? One natural way to approach things would be to just iterate through the coordinates of ${\bf x}$, minimizing the function over each one. Say we want to minimize with respect to $x_i$. So, we are going to make the change $x_i \leftarrow x_i + d$. Thus, we want to minimize (with respect to d), this thing:

$\frac{1}{2} ({\bf x}+d {\bf e_i})^T A ({\bf x}+d {\bf e_i}) - ({\bf x}+d {\bf e_i})^T {\bf b}$

Expanding this, taking the derivative w.r.t. $d$, and solving gives us (where ${\bf e_i}$ is the unit vector in the $i$th direction)

$d = \frac{{\bf e_i}^T({\bf b}-A{\bf x})}{{\bf e_i}^TA{\bf e_i}}$,

or, equivalently,

$d = \frac{1}{A_{ii}}(b_i-A_{i*}{\bf x})$.

So, taking the solution at one timestep, ${\bf x}^t$, and solving for the solution ${\bf x}^{t+1}$ at the next timestep, we will have, for the first index,

$x^{t+1}_1 = x^t_1 + \frac{1}{A_{11}}(b_1-A_{1*}x^t)$.

Meanwhile, for the second index,

$x^{t+1}_2 = x^t_2 + \frac{1}{A_{22}}(b_2-A_{2*}(x^t-{\bf e_1} x^t_1 +{\bf e_1} x^{t+1}_1 ))$.

And, more generally

$x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-A_{n*}(\sum_{i.

Now, all is a matter of cranking through the equations.

$x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-\sum_{i

$D {\bf x}^{t+1} = D {\bf x}^t + {\bf b}- L x^{t+1} - (D+U) {\bf x}^t$

$(D+L) {\bf x}^{t+1} = {\bf b}-U {\bf x}^t$

Which is exactly the iteration we started with.

The fact that this would work when $A$ is symmetric positive definite is now obvious– that is when the objective function we started with is bounded below.

The understanding that Gauss-Seidel is just a convenient way to implement coordinate descent also helps to get intuition for the numerical properties of Gauss-Seidel (e.g. quickly removing "high frequency error", and very slowly removing "low frequency error").

# A simple explanation of reverse-mode automatic differentiation

My previous rant about automatic differentiation generated several requests for an explanation of how it works. This can be confusing because there are different types of automatic differentiation (forward-mode, reverse-mode, hybrids.) This is my attempt to explain the basic idea of reverse-mode autodiff as simply as possible.

Reverse-mode automatic differentiation is most attractive when you have a function that takes $n$ inputs $x_1,x_2,...,x_n$, and produces a single output $x_N$. We want the derivatives of that function, $\displaystyle{\frac{d x_N}{d x_i}}$, for all $i$.

Point #1: Any differentiable algorithm can be translated into a sequence of assignments of basic operations.

Forward-Prop

for $i=n+1,n+2,...,N$

$x_i \leftarrow f_i({\bf x}_{\pi(i)})$

Here, each function $f_i$ is some very basic operation (e.g. addition, multiplication, a logarithm) and $\pi(i)$ denotes the set of “parents” of $x_i$. So, for example, if $\pi(7)=(2,5)$ and $f_7 = \text{add}$, then $x_7 = x_2 + x_5$.

It would be extremely tedious, of course, to actually write an algorithm in this “expression graph” form. So, autodiff tools create this representation automatically from high-level source code.

Point #2: Given an algorithm in the previous format, it is easy to compute its derivatives.

The essential point here is just the application of the chain rule.

$\displaystyle{ \frac{d x_N}{d x_i} = \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial x_j}{\partial x_i}}$

Applying this, we can compute all the derivatives in reverse order.

Back-Prop

$\displaystyle{ \frac{d x_N}{d x_N} \leftarrow 1}$

for $i=N-1,N-2,...,1$

$\displaystyle{ \frac{d x_N}{d x_i} \leftarrow \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial f_j}{\partial x_i}}$

That’s it!  Just create an expression graph representation of the algorithm and differentiate each basic operation $f_i$ in reverse order using calc 101 rules.

Other stuff:

• No, this is not the same thing as symbolic differentiation.  This should be obvious:  Most algorithms don’t even have simple symbolic representations.  And, even if yours does,  it is possible that it “explodes” upon symbolic differentiation.  As a contrived example, try computing the derivative of $\exp(\exp(\exp(\exp(\exp(\exp(x))))))$.
• The complexity of the back-prop step is the same as the forward propagation step.
• In machine learning, functions from N inputs to one output come up all the time:  The N inputs are parameters defining a model, and the 1 output is a loss, measuring how well the model fits training data.  The gradient can be fed into an optimization routine to fit the model to data.
• There are two common ways of implementing this:
1. Operator Overloading.  One can create a new variable type that has all the common operations of numeric types, which automatically creates an expression graph when the program is run.   One can then call the back-prop routine on this expression graph.  Hence,  one does not need to modify the program, just replace each numeric type with this new type.  This is fairly easy to implement, and very easy to use.  David Gay‘s RAD toolbox for C++ is a good example, which I use all the time.
The major downside of operator overloading is efficiency:  current compilers will not optimize the backprop code.  Essentially, this  step is interpreted.  Thus, one finds in practice a non-negligible overhead of, say, 2-15 times the complexity of the original algorithm using a native numeric type.  (The overhead depends on how much the original code benefits from compiler optimizations.)
2. Source code transformation. Alternatively, one could write a program that examines the source code of the original program, and transforms this into source code computing the derivatives.  This is much harder to implement, unless one is using a language like Lisp with very uniform syntax.  However, because the backprop source code produced is then optimized like normal code, it offers the potential of zero overhead compared with manually computed derivatives.
• If it isn’t convenient to use automatic differentiation, one can also use “manual automatic differentation”.  That is, to compute the derivatives, just attack each intermediate value your algorithm computes, in reverse order.
• Some of the most interesting work on autodiff comes from Pearlmutter and Siskind, who have produced a system called Stalingrad for a subset of scheme that allows for crazy things like taking derivatives of code that itself is taking derivates.  (So you can, for example, produce Hessians.)  I think they wouldn’t mind hearing from potential users.

# Hessian-Vector products

You have some function $f({\bf x})$.  You have figured out how to compute it’s gradient, $g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x})$.  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian $H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x})$ with certain vectors.  You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has $N^2$ elements, where $N$ is the length of $\bf x$ and that is too big to deal with (more probable).  What to do?  Behold:

${\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}$

Consider the Hessian-Vector product we want to compute, $H({\bf x}){\bf v}$.  For small $r$,

${\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}$

And so,

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}$

This trick above has been around apparently forever.  The approximation becomes exact in the limit $r \rightarrow 0$.  Of course, for small $r$, numerical problems will also start to kill you.  Pearlmutter’s algorithm is a way to compute $H {\bf v}$ with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

$\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.$

This expression will be closer to the true value for larger $r$, meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.

# Why does regularization work?

When fitting statistical models, we usually need to “regularize” the model. The simplest example is probably linear regression. Take some training data, ${({\hat {\bf x}}, {\hat y})}$. Given a vector of weights $\bf w$, the total squared distance is

$\sum_{ ({\hat {\bf x}}, {\hat y})} ({\bf w}^T{\hat {\bf x}} - {\hat y})^2$

So to fit the model, we might find ${\bf w}$ to minimize the above loss. Commonly, (particularly when ${\bf x}$ has many dimensions), we find t\hat the above procedure leads to overfitting: very large weights ${\bf w}$ t\hat fit the training data very well, but poorly predict future data. “Regularization” means modifying the optimization problem to prefer small weights. Consider

$\sum_{ ({\hat {\bf x}}, {\hat y})} ({\bf w}^T{\hat {\bf x}} - {\hat y})^2 + \lambda ||{\bf w}||^2.$

Here, $\lambda$ trades off how much we care about the model fitting well versus how we care about ${\bf w}$ being small. Of course, this is a much more general concept that linear regression, and one could pick many alternatives to the squared norm to measure how simple a model is. Let’s abstract a bit, and just write some function $M$ to represent how well the model fits the data, and some function $R$ to represent how simple it is.

$M({\bf w}) + \lambda R({\bf w})$

The obvious question is: how do we pick $\lambda$? This, of course, has been the subject of much theoretical study. What is commonly done in practice is this:

1. Divide the data into a training set and test set.
2. Fit the model for a range of different $\lambda$s using only the training set.
3. For each of the models fit in step 2, check how well the resulting weights fit the test data.
4. Output the weights that perform best on test data.

(One can also retrain on all the data using the $\lambda$ that did best in step 2.)

Now, there are many ways to measure simplicity. (In the example above, we might consider $||{\bf w}||^2$, $||{\bf w}||$, $||{\bf w}||_1$, $||{\bf w}||_0$, …). Which one to use? If you drank too much coffee one afternoon, you might decide to include a bunch of regulizers simultaneously, each with a corresponding regularization parameter, ending up with a problem like

$M({\bf w}) + \sum_i \lambda_i R_i({\bf w})$.

And yet, staring at that equation, things take a sinister hue: if we include too many regularization parameters, won’t we eventually overfit the regularization parameters? And furthermore, won’t it be very hard to test all possible combinations of $\lambda_i$ to some reasonable resolution? What should we do?

And now I will finally get to the point.

Recently, I’ve been looking at regularization in a different way, which seems very obvious in retrospect. The idea is that when searching for the optimal regularization parameters, we are fitting a model– a model that just happens to be defined in an odd way.

First, let me define some notation. Let $T$ denote the training set, and let $S$ denote the test set. Now, define

${\bf f}(\lambda) = \arg\min_{\bf w} M_T({\bf w}) + \lambda R({\bf w})$

Where $M_T$ is the function measuring how well the model fits the training data. So, for a given regularization parameter, ${\bf f}$ returns the weights solving the optimization problem. Given that, the optimal \lambda will the the one minimizing this equation:

$\lambda^* = \arg\min_\lambda M_S( {\bf f}(\lambda) )$

This extends naturally to the setting where we have a vector of regularization parameters ${\bf \lambda}$.

${\bf f}({\boldsymbol \lambda}) = \arg\min_{\bf w} M_T({\bf w}) + \sum_i \lambda_i R_i({\bf w})$

${\boldsymbol \lambda}^* = \arg\min_{\bf \lambda} M_S( {\bf f}({\boldsymbol \lambda}) )$

From this perspective, doing an optimization over regularization parameters makes sense: it is just fitting the parameters $\boldsymbol \lambda$ to the data $S$— it just so happens that the objective function is implicitly defined by a minimization over the data $T$.

Regularization works because it is fitting a single parameter to some data. (In general, one is safe fitting a single parameter to any reasonably sized dataset although there are exceptions.) Fitting multiple regularization parameters should work, supposing the test data is big enough to support them. (There are computational issues I’m not discussing here with doing this, stemming from the fact that ${\bf f}({\boldsymbol \lambda})$ isn’t known in closed form, but is implicitly defined by a minimization. So far as I know, this is an open problem, although I suspect the solution is “implicit differentiation”)

Even regularizing the regularization parameters isn’t necessarily a crazy idea, though clearly $||{\boldsymbol \lambda}||^2$ isn’t the way to do it. (Simpler models correspond to larger regularization constants. Maybe $1/||{\boldsymbol \lambda}||^2$ ?.) Then, you can regularize those parameters!

Well. In retrospect that wasn’t quite so simple as it seemed.

It’s a strong bet that this is a standard interpretation in some communities.

Update: The idea of fitting multiple parameters using implicit differentiation is published in the 1996 paper Adaptive regularization in neural network modeling by Jan Larsen, Claus Svarer, Lars Nonboe Andersen and Lars Kai Hansen.

# Unbiased coinflips from biased coinflips

A old problem, due to von Neumann goes as follows:

You have a biased coin that produces heads (H) with probability $p$, and tails (T) with probability $(1-p).$You don’t know $p$. How can you use this coin to simulate an unbiased coin?

The next paragraph contains a solution, so if you want to solve the problem yourself, stop reading now!

von Neumann’s solution was as follows*:

1. Flip the coin twice
2. If the outcome is HT, output H
3. If the outcome is TH, output T
4. Otherwise, go to 1.

A nice solution, but you can see that you might need to flip the coin many times. In particular, the probability of getting either HH or TT in any particular round is $p^2 + (1-p)^2$, which could be really big for highly biased coins.

Is there a more efficient method?

Today I found a beautiful paper examining this question. The insight is that the von Neumann scheme is based on symmetry– picking pairs of output strings that have equal probability, then outputting heads for one and tails for the other.

You can draw an (infinitely large) tree, with branches corresponding to random coin flip outcomes, and leaf nodes for an output. The expected number of coin flips for an output is the expected depth in the tree that one reaches before outputting a result and quitting. Viewed in this light, one can think of ways to create more efficient algorithms. The paper contains trees illustrating this wonderfully.

* Incidentally, when starting a paragraph with the name of a person whose first name is not capitalized, should the first character of the paragraph be capitalized? Which convention takes precedence?