More confidence games

You drew 40 random cells from a sample and found that a new drug affected 16 of them. An online calculator told you:

“With 90% confidence, the true fraction is between 26.9% and 54.2%.”

But what does this really mean? We’ve talked about confidence sets before—read that post first if you find this one too difficult. Now let’s talk about intervals.

Confidence intervals

Years from now, all Generation β does is sit around meditating on probability theory and reading Ars Conjectandi. You work at a carnival where one day your boss says, “Our new ride is now so popular that we only have capacity for 10% of guests. I’ve got some bent coins of different colors, and I want you to create a guessing game to ration access to the ride.

Here’s how the game is supposed to work:

  1. The guest will pick one coin, flip it, and announce the outcome.
  2. Based on that outcome, you guess some set of colors.
  3. The true color of the coin is revealed. If it’s not in the set of colors, the guest can go on the ride.

It’s essential that no matter what the guests do, only 10% of them can win the game. But otherwise, you’d like to guess as few colors as possible to better impress the players.

You’re given five bent coins, which your boss has CT-scanned and run exhaustive simulations to find the true probability each will come up heads.

Coin Prob. tails Prob. heads
red .9 .1
green .7 .3
blue .5 .5
yellow .3 .7
white .1 .9

Your first idea for a game is the obvious one: Flip it, and try to guess the color based on the outcome of heads or tails. While you could do that it would be extremely boring.

Then, you have another idea. Why not flip the coin twice? Take the red coin. It’s easy to compute the probability of getting different numbers of heads:

  • 0 heads: (0.9)2 = 0.81 (The probability of rolling tails twice in a row.)
  • 1 head: 0.1 × 0.9 + 0.9 × 0.1 = 0.18. (The probability of either rolling heads × tails plus the probability of rolling tails × heads.)
  • 2 heads: (0.1)2 = 0.01 (The probability of rolling heads twice in a row.)

Continuing this way, you make a table of the probability of getting a total number of heads after two coin flips for each of the coins:

0 heads 1 head 2 heads
red .81 .18 .01
green .49 .42 .09
blue .25 .50 .25
yellow .09 .42 .49
white .01 .18 .81

You could make a game based on two coinflips, but why not make things even more interesting? Why not spice things up even more by flipping the coin, say, 5 times. You do a little bit of research, and you discover that the probability of getting a total of tot-heads heads after doing num-flips flips of a coin with a bias of prob is called a Binomial, namely Binomial(tot-heads | num-flips, prob). For example, the probabilities we calculated above are Binomial(0 | 2, 0.1)=0.81, Binomial(1 | 2, 0.1)=0.18, and Binomial(1 | 2, 0.1)=0.01. If num-flips is larger than 2 the math gets more complicated, but who cares? You find some code that can compute Binomial probabilities, and you use it to create the following table of the probability of getting each total number of heads after 5 coinflips:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This seems to make sense: The most likely outcome for the red coin is all tails, since the red coin is rarely heads. The most likely outcomes for the blue coin are nearly evenly distributed, while the most likely outcome for the white coin is all heads.

Now, what colors should you guess for each outcome? Again, you need to make sure that, no matter what color the guest chooses, you will include that color with 90% probability. This is equivalent to covering .9 of the probability from each row. You decide to go about this in a greedy way. For each row, you add entries from largest to smallest until you get a total that’s above 0.9. If you do that, you get this result:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This corresponds to the following confidence sets:

Outcome What you guess
0 {red, green}
1 {red, green, blue}
2 {green, blue, yellow}
3 {green, blue, yellow}
4 {blue, yellow, white}
5 {yellow, white}

Remember what we stressed last time: When we get 4 heads and say we are “90% confident” the color is blue, yellow, or white, we don’t mean “90% probability”, we just mean that your guessing procedure will work for 90% of guests. After seeing 4 heads, the probability of given colors is—according to the worldview of confidence intervals—meaningless because it’s a fixed quantity. And even if you’re willing to talk about probabilities in such situations, the probability could be much higher or lower than 90%.

It occurs to you that you can visualize this as a heatmap, with lighter colors representing higher probabilities. The entries in the following figure are laid out in the same way as the above table. Remember that red has a .1 probability of .being heads so it is in the first row, green has a .3 probability so it’s in the second row, etc.

You can visualize the confidence sets by drawing an outline around the coin/outcome pairs that are included in your strategy.

Things go well for a while, but then your boss comes around again and says “People want more of a challenge!” You’re given 19 coins with each of the probabilities .05, .10, .15, …, .95 and told to increase the number of coin flips from 5 to 40.

At this point, it would be tedious to look at tables of numbers, but you can still visualize things:

You can use the same greedy strategy of including elements from each row until you get a sum of 0.9. If you do that, this is what you end up covering:

Notice: For any given outcome, the set of coins that you include are always next to each other. This happens just because for each coin, there’s a single mode of probability around a given outcome, and the location of this mode changes smoothly as the bias of the coin changes. This is why we can talk about confidence intervals rather than confidence sets: The math happens to work out in such a way that the included coins are always next to each other.

Finally, your boss suggests one last change. The game should work this way:

  1. Each guest is given a soft-metal coin, which they can bend into whatever shape they want.
  2. That coin is flipped 40 times, and the outcome is announced.
  3. You need to guess some interval that hopefully contains the true bias of the coin.
  4. The coin is CT-scanned, and the carnival’s compute cluster finds the true bias. If it’s not in the interval you guessed, the guest can go on the ride.

Thinking about how to address this came, it occurs to you that you can make figures in the same way with any number of coins, and if you use a fine enough grid, you will cover all possibilities. The following figure shows what you get if you use the same process with 1001 coins ranging from 0.000, 0.001, 0.002, …, 1.000.

Now, remember, where we started: We tested 40 cells and found that 16 of them had changed, and an online calculator told us that with 90% confidence the true fraction was between 26.9% and 54.2%.

To understand where these numbers come from, just take this figure and put a vertical line at # heads = 16:

What’s included is all the coins with biases between .269 and .542. That’s why the confidence interval for 16 out of 40 is 26.9% to 54.2%.

Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

  • You’ve got a coin, which comes up heads with probability w. You don’t know w.
  • You flip the coin n times.
  • You see k heads and n-k tails.
  • You do something, depending on k. (We’ll come back to this.)

Simple enough, right? Remember, k is the total number of heads after n flips. If you do some math, you can work out a formula for p(k\vert w,n): the probability of seeing exactly k heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written \mathrm{Binomial}(k\vert n,w).

Here’s an example of what this looks like with n=21 and w=0.5.


Naturally enough, if w=.5, with 21 flips, you tend to see around 10-11 heads. Here’s an example with w=0.2. Here, the most common value is 4, close to 21\times.2=4.2.


Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as


This is just a function of one integer k. You can think of this as just a list of what guess to make, for each possible observation, for example:

k \mathrm{Dec}(k)
0 \mathrm{tails}
1 \mathrm{heads}
2 \mathrm{heads}
\vdots \vdots
n \mathrm{tails}

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k<n/2 \end{cases}.

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

  • Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
  • Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as


The first input is the true (unknown) value w, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now L is defined comparing d to w, not to a new coinflip. We do this because comparing to w gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were w. Take loss A. If you predict “tails”, you’ll be wrong with probability w, while if you predict “heads”, you’ll be wrong with probability 1-w, and so lose 1-w dollars on average. This leads to the loss

L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).

Here, the second input to R is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k<m \end{cases}.

With n=21 there are 22 such decision rules, corresponding to m=0, (always predict heads), m=1 (predict heads if you see at least one heads), up to m=22 (always predict tails). These are shown here:

These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses L_{A} and L_{B}, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix w=0.2, and I sweep through all the decision rules (by changing m) with loss L_{A}:


The value R_A in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for w=0.4, which you can is pictured here:


We can visualize the risk in one figure with various w and m. Notice that the curves for w=0.2 and w=0.4 are exactly the same as we saw above.


Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss L_{B} we get the following:


Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes R(w,\mathrm{Dec})?

The answer is: because we don’t know w! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution p(w) over w. This says which values of w we think are more and less likely. Then, we define the Bayesian risk as the average of R over the prior:


This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think w is. Here’s the Bayes risk corresponding to L_{A} with a uniform prior p(w)=1:


For reference, the risk curves R(w,\mathrm{Dec}_m) are shown in light grey. Naturally enough, for each value of m, the Bayes risk is just the average of the regular risks for each w.

Here’s the risk corresponding to L_{B}:


That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones \mathrm{Dec}_m. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are 2^{22}, since you can output heads or tails for each of k=0, k=1, …, k=21.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).

This is a sum over k where each term only depends on a single value \mathrm{Dec}(k). So, we just need to make the best decision for each individual value of k separately. This leads to the Bayes-optimal decision rule of

\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.

With a uniform prior p(w)=1, here’s the optimal Bayesian decision rules with loss L_{A}:


And here it is for loss L_B:


Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for L_{B}, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family \mathrm{Dec}_{m} defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process (p(k\vert w,n)) the loss function (L(w,d)) and the prior distribution (p(w)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

  • Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
  • Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
  • Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with w drawn from p(w). Then, no decision rule will incur less loss than \mathrm{Dec}_{\text{Bayes}}. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as


Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and L_{A}:


As you can see, for each individual decision rule, it searches over the space of parameters w to find the worst case. We can visualize the risk with L_{B} as:


What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules \mathrm{Dec}_m. But be very mindful that this doesn’t work in general!

For L_{A} we end up with the same decision rule as when minimizing Bayesian risk:


For L_{B}, meanwhile, we get something slightly different:


This is even more conservative than the Bayesian decision rule. \mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}, while \mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of w. Even calculating R(w,\mathrm{Dec}) for fixed w and \mathrm{Dec} is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.

Fitting an inference algorithm instead of a model

One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm. Basically, forget about the distribution that the CRF represents, and instead only care how accurate are the results that pop out of inference. An extreme example of this is the recent paper Learning Real-Time MRF Inference for Image Denoising by Adrian Barbu.

The basic idea is to fit a FoE (Field of Experts) image prior such that when one takes a very few gradient descent steps on a denoising posterior, the results are accurate. From the abstract:

We argue that through appropriate training, a MRF/CRF model can be trained to perform very well on a suboptimal inference algorithm. The model is trained together with a fast inference algorithm through an optimization of a loss function […] We apply the proposed method to an image denoising application […] with a 1-4 iteration gradient descent inference algorithm. […] the proposed training approach obtains an improved benchmark performance as well as a 1000-3000 times speedup compared to the Fields of Experts MRF.

The implausible-sounding 1000-fold speedup comes simply from using only 4 iterations of gradient descent rather than several thousand. (Incidentally, L-BFGS requires far fewer iterations for this problem.) The results are a bit better than the generative FoE model– that takes much more work for training and inference.

I have every confidence that this does work well, and similar strategies could probably be used to create fast inference models/algorithms for many different problems. My thesis was largely an attempt to do the same thing for marginal, rather than MAP inference.

The disturbing/embarrassing question, for me, is does this really have anything to do with probabilistic modeling any more? Formally speaking, a probability density is being fit, but I doubt it would transfer to, say, inpainting, or that samples from the density would look like natural images. The best interpretation of what is happening might be that one is simply fitting a big, nonlinear, black box function approximation.

It seems that the more effort we expend to wring the best performance out of a probabilistic model, the less “probabilistic” it is.

P.S. Some of my friends have invited me to never mention autodiff again, ever, but this is one of the many papers where I think the learning optimization would be made much easier/faster by using it.

Unbiased coinflips from biased coinflips

A old problem, due to von Neumann goes as follows:

You have a biased coin that produces heads (H) with probability p, and tails (T) with probability (1-p).You don’t know p. How can you use this coin to simulate an unbiased coin?

The next paragraph contains a solution, so if you want to solve the problem yourself, stop reading now!

von Neumann’s solution was as follows*:

  1. Flip the coin twice
  2. If the outcome is HT, output H
  3. If the outcome is TH, output T
  4. Otherwise, go to 1.

A nice solution, but you can see that you might need to flip the coin many times. In particular, the probability of getting either HH or TT in any particular round is p^2 + (1-p)^2, which could be really big for highly biased coins.

Is there a more efficient method?

Today I found a beautiful paper examining this question. The insight is that the von Neumann scheme is based on symmetry– picking pairs of output strings that have equal probability, then outputting heads for one and tails for the other.

You can draw an (infinitely large) tree, with branches corresponding to random coin flip outcomes, and leaf nodes for an output. The expected number of coin flips for an output is the expected depth in the tree that one reaches before outputting a result and quitting. Viewed in this light, one can think of ways to create more efficient algorithms. The paper contains trees illustrating this wonderfully.

* Incidentally, when starting a paragraph with the name of a person whose first name is not capitalized, should the first character of the paragraph be capitalized? Which convention takes precedence?

Markov’s Inequality

While looking at Tao’s post on the law of large numbers, I claimed to Alap that Markov’s inequality was “obvious”. Asked to back that up, though, I couldn’t remember why it was obvious. Take some distribution p(x). For simplicity, assume this distribution is over the non-negative reals. Then, Markov’s inequality states, for \lambda > 0

P[x\geq\lambda] \leq \frac{E[x]}{\lambda}.

Though I find it more intuitive in the form

\lambda P[x \geq \lambda] \leq E[x].

Now, a proof is pretty trivial, basically by writing down the definitions of both sides.

E[x] = \int_{x\geq 0} x p(x) dx

\lambda P[x \geq \lambda] = \int_{x\geq\lambda} \lambda p(x) dx \leq \int_{x\geq\lambda} x p(x) dx

\leq \int_{x\geq 0} x p(x) dx

However, I think more intuition can be conveyed with pictures.

Matlab Code:

function markov_inequality

fsize = 16;
lambda = 5;

x = 0:.01:15;
p = .5*exp(-(x-3).^2) + exp(-(x-7).^2/4);

xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])

xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
legend('x p(x)');

xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
hold on;
legend('x p(x)', '\lambda p(x)');

area(x,x.*p,'FaceColor',[.7 .7 .7],'LineWidth',2)
hold on;
good = find(x>=lambda);
area(x(good),p(good)*lambda,'FaceColor',[.8 0 0],'LineWidth',2)
hold off
legend('E[x]','\lambda P[x \geq \lambda]')
xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])