When fitting statistical models, we usually need to “regularize” the model. The simplest example is probably linear regression. Take some training data, . Given a vector of weights , the total squared distance is

So to fit the model, we might find to minimize the above loss. Commonly, (particularly when has many dimensions), we find t\hat the above procedure leads to **overfitting**: very large weights t\hat fit the training data very well, but poorly predict future data. “Regularization” means modifying the optimization problem to prefer small weights. Consider

Here, trades off how much we care about the model fitting well versus how we care about being small. Of course, this is a much more general concept that linear regression, and one could pick many alternatives to the squared norm to measure how simple a model is. Let’s abstract a bit, and just write some function to represent how well the model fits the data, and some function to represent how simple it is.

The obvious question is: how do we pick ? This, of course, has been the subject of much theoretical study. What is commonly done **in practice** is this:

- Divide the data into a training set and test set.
- Fit the model for a range of different s using only the training set.
- For each of the models fit in step 2, check how well the resulting weights fit the test data.
- Output the weights that perform best on test data.

(One can also retrain on all the data using the that did best in step 2.)

Now, there are many ways to measure simplicity. (In the example above, we might consider , , , , …). Which one to use? If you drank too much coffee one afternoon, you might decide to include a bunch of regulizers simultaneously, each with a corresponding regularization parameter, ending up with a problem like

.

And yet, staring at that equation, things take a sinister hue: if we include too many regularization parameters, won’t we eventually overfit the regularization parameters? And furthermore, won’t it be very hard to test all possible combinations of to some reasonable resolution? What should we do?

**And now I will finally get to the point.**

Recently, I’ve been looking at regularization in a different way, which seems very obvious in retrospect. The idea is that when searching for the optimal regularization parameters, we are fitting a model– a model that just happens to be defined in an odd way.

First, let me define some notation. Let denote the training set, and let denote the test set. Now, define

Where is the function measuring how well the model fits the training data. So, for a given regularization parameter, returns the weights solving the optimization problem. Given that, the optimal \lambda will the the one minimizing this equation:

This extends naturally to the setting where we have a vector of regularization parameters .

From this perspective, doing an optimization over regularization parameters makes sense: **it is just fitting the parameters to the data — it just so happens that the objective function is implicitly defined by a minimization over the data .**

Regularization works because it is fitting a single parameter to some data. (In general, one is safe fitting a single parameter to any reasonably sized dataset although there are exceptions.) Fitting multiple regularization parameters should work, supposing the test data is big enough to support them. (There are computational issues I’m not discussing here with doing this, stemming from the fact that isn’t known in closed form, but is implicitly defined by a minimization. So far as I know, this is an open problem, although I suspect the solution is “implicit differentiation”)

Even regularizing the regularization parameters isn’t necessarily a crazy idea, though clearly isn’t the way to do it. (Simpler models correspond to **larger **regularization constants. Maybe ?.) Then, you can regularize **those** parameters!

Well. In retrospect that wasn’t quite so simple as it seemed.

It’s a strong bet that this is a standard interpretation in some communities.

Update: The idea of fitting multiple parameters using implicit differentiation is published in the 1996 paper Adaptive regularization in neural network modeling by Jan Larsen, Claus Svarer, Lars Nonboe Andersen and Lars Kai Hansen.

Hello,

I just happen to see you website where you explain regularization. Very well explained just a small detail that I think is missing how to apply lamda back as input.

M({\bf w}) + \lambda R({\bf w})

You see from the equation that you given Lamda is just multiplier you can choose any abrupt number and output may not change or I am missing something?

Hello,

I’m not sure I understand your question, but in general it is not obvious exactly how to choose the value of lambda. I suppose that the standard “best practice” would be to try a range of different lambdas and see which leads to the best performance on a validation set (or, better, to do this using cross-validation).

This was awesome, thank you!