# You deserve better than two-sided finite differences

In calc 101, the derivative is derived as $df/dx = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon$. So, if you want to estimate a derivative, an easy way to do so would be to just pick some small $\epsilon$ and estimate:

$df/dx \approx \frac{1}{\epsilon} (f(x+\epsilon)-f(x))$

This can work OK. Let’s look at an example of trying to calculate the derivative of $\log (x)$, using a range of different $\epsilon$

What’s happening? Well, the result is true mathematically in the limit that $\epsilon$ is small, so it’s natural to get errors for large $\epsilon$. However, with very small $\epsilon$ you run into trouble because floating-point arithmetic can only represent finite precision. Let’s try again with a smaller value of $x=.001$

That’s somewhat concerning. We can still get a nearly correct value, but we have a limited range of steps that achieve it. This is very well-known in numerical analysis, and a common solution is to use two-sided differences, i.e. to estimate

$df/dx \approx \frac{1}{2 \epsilon} (f(x+\epsilon)-f(x-\epsilon)).$

If we try that, we indeed get better results:

What’s happening here? Basically, we are using more information about the function to make a higher-order approximation, so we can mathematically get away with using a larger value of $\epsilon$. This in turn isis helpful to avoid the numerical precision demons.

Great! But let’s go deeper. If we use $x = 10^{-5}$, we get:

Huh. 1-sided differences totally fall apart, but we seem to be running into more trouble. But never fear, you can use “four-sided differences”!

$df/dx \approx \frac{1}{12 \epsilon} (-f(x+2\epsilon)+8 f(x+\epsilon)-8 f(x+\epsilon)+ f(x-2 \epsilon)$

Then, we get what you might expect:

But what if we go deeper, with $x=10^{-7}$?

Or maybe we should go even deeper, with $x = 10^{-9}$?

Even the four-sided differences have failed us. Now, you might take a lesson here that you shouldn’t be using numerical differences (and I sort of agree). Those of certain temperament, on the other hand, would say instead that what we need is power. OK then, how do six-sided differences sound to you? Sound good?

$df/dx \approx \frac{1}{60 \epsilon}(- f(x-3 \epsilon)+9 f(x-2 \epsilon)-45 f(x-\epsilon)+45 f(x+\epsilon)-9 f(x+2\epsilon)+f(x+3 \epsilon))$

This is still tough, but there is at least some range of epsilon where you can calculate a reasonable derivative. There is, of course, a never-ended sequence of these higher-order derivative approximations. There’s even a calculator, for all your bespoke finite-difference stencil needs.

Now, you might think that the real solution here is to use automatic differentiation, and you’re mostly right. It takes more computation to sample the function at more points, and no number of samples will fundamentally stop the numerical demons from destroying your result.

However, there still remain cases where it’s worthwhile to manually compute a derivative. More importantly perhaps, when you’re implementing an automatic differentiation tool, you still need to test that it is actually correct! Here, numerically differences will probably forever remain useful. So, certainly for the cases of building a test suite for autodiff code, it certainly makes sense to use these higher-order derivatives.

(I originally intended to leave this as a comment on Tim Viera’s post on testing gradient implementations, but this kinda got out of control.)

## 7 thoughts on “You deserve better than two-sided finite differences”

1. Abhijit Kundu says:

Very nice post. Reminds me of using Ridder’s method for iteratively refining the epsilon starting with a large step size.

2. justindomke says:

Thanks for that. Do you have a reference for Ridder’s method? I’m familiar with the use for root finding (which is clearly related) but I can’t seem to find a reference directly for estimating derivatives.

3. This is super awesome! I hate numerical demons! How did you measure the numerical accuracy to make these plots? Also, have you heard of Herbie (https://herbie.uwplse.org/)?

4. Cool post, thanks. Is there any reason not to use Chebyshev nodes for the interpolating points of the polynomial, rather than evenly-spaced? I guess this is called the “stencil”? I was under the impression Chebyshev gives the best approximation.

5. justindomke says:

I’m no expert, but I think that Chebyshev nodes are probably a good choice. That being said, they are the “best” choice only in a certain minimax sense, that doesn’t necessarily mean the best for your particular function in the metric of error that you really care about.