You drew 40 random cells from a sample and found that a new drug affected 16 of them. An online calculator told you:

“With 90% confidence, the true fraction is between 26.9% and 54.2%.”

But what does this *really* mean? We’ve talked about confidence *sets* before—read that post first if you find this one too difficult. Now let’s talk about intervals.

## Confidence intervals

Years from now, all Generation β does is sit around meditating on probability theory and reading Ars Conjectandi. You work at a carnival where one day your boss says, “*Our new ride is now so popular that we only have capacity for 10% of guests. I’ve got some bent coins of different colors, and I want you to create a guessing game to ration access to the ride.*

Here’s how the game is supposed to work:

- The guest will pick one coin, flip it, and announce the outcome.
- Based on that outcome, you guess some set of colors.
- The true color of the coin is revealed. If it’s
*not*in the set of colors, the guest can go on the ride.

It’s essential that no matter what the guests do, only 10% of them can win the game. But otherwise, you’d like to guess as few colors as possible to better impress the players.

You’re given five bent coins, which your boss has CT-scanned and run exhaustive simulations to find the true probability each will come up heads.

Coin | Prob. tails | Prob. heads |
---|---|---|

red | .9 | .1 |

green | .7 | .3 |

blue | .5 | .5 |

yellow | .3 | .7 |

white | .1 | .9 |

Your first idea for a game is the obvious one: Flip it, and try to guess the color based on the outcome of heads or tails. While you could do that it would be extremely boring.

Then, you have another idea. Why not flip the coin twice? Take the red coin. It’s easy to compute the probability of getting different numbers of heads:

- 0 heads: (0.9)
^{2}= 0.81 (The probability of rolling tails twice in a row.) - 1 head: 0.1 × 0.9 + 0.9 × 0.1 = 0.18. (The probability of either rolling heads × tails plus the probability of rolling tails × heads.)
- 2 heads: (0.1)
^{2}= 0.01 (The probability of rolling heads twice in a row.)

Continuing this way, you make a table of the probability of getting a *total* number of heads after two coin flips for each of the coins:

0 heads | 1 head | 2 heads | |
---|---|---|---|

red | .81 | .18 | .01 |

green | .49 | .42 | .09 |

blue | .25 | .50 | .25 |

yellow | .09 | .42 | .49 |

white | .01 | .18 | .81 |

You could make a game based on two coinflips, but why not make things even more interesting? Why not spice things up even more by flipping the coin, say, 5 times. You do a little bit of research, and you discover that the probability of getting a total of **tot-heads** heads after doing **num-flips** flips of a coin with a bias of `prob`

is called a Binomial, namely **Binomial(tot-heads | num-flips, prob)**. For example, the probabilities we calculated above are **Binomial(0 | 2, 0.1)=0.81**, **Binomial(1 | 2, 0.1)=0.18**, and **Binomial(1 | 2, 0.1)=0.01**. If **num-flips** is larger than 2 the math gets more complicated, but who cares? You find some code that can compute Binomial probabilities, and you use it to create the following table of the probability of getting each total number of heads after 5 coinflips:

0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|

red | .59049 | .32805 | .07290 | .00810 | .00045 | .00001 |

green | .16807 | .36015 | .30870 | .13230 | .02835 | .00243 |

blue | .03125 | .15625 | .31250 | .31250 | .15625 | .03125 |

yellow | .00243 | .02835 | .13230 | .30870 | .36015 | .16807 |

white | .00001 | .00045 | .00810 | .07290 | .32805 | .59049 |

This seems to make sense: The most likely outcome for the red coin is all tails, since the red coin is rarely heads. The most likely outcomes for the blue coin are nearly evenly distributed, while the most likely outcome for the white coin is all heads.

Now, what colors should you guess for each outcome? Again, you need to make sure that, no matter what color the guest chooses, you will include that color with 90% probability. This is equivalent to covering .9 of the probability from each row. You decide to go about this in a greedy way. For each row, you add entries from largest to smallest until you get a total that’s above 0.9. If you do that, you get this result:

0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|

red | .59049 |
.32805 |
.07290 | .00810 | .00045 | .00001 |

green | .16807 |
.36015 |
.30870 |
.13230 |
.02835 | .00243 |

blue | .03125 | .15625 |
.31250 |
.31250 |
.15625 |
.03125 |

yellow | .00243 | .02835 | .13230 |
.30870 |
.36015 |
.16807 |

white | .00001 | .00045 | .00810 | .07290 | .32805 |
.59049 |

This corresponds to the following confidence sets:

Outcome | What you guess |
---|---|

0 | {red, green} |

1 | {red, green, blue} |

2 | {green, blue, yellow} |

3 | {green, blue, yellow} |

4 | {blue, yellow, white} |

5 | {yellow, white} |

Remember what we stressed last time: When we get 4 heads and say we are “90% confident” the color is blue, yellow, or white, we don’t mean “90% probability”, we just mean that your guessing procedure will work for 90% of guests. After seeing 4 heads, the probability of given colors is—according to the worldview of confidence intervals—meaningless because it’s a fixed quantity. And even if you’re willing to talk about probabilities in such situations, the probability could be much higher or lower than 90%.

It occurs to you that you can visualize this as a heatmap, with lighter colors representing higher probabilities. The entries in the following figure are laid out in the same way as the above table. Remember that red has a .1 probability of .being heads so it is in the first row, green has a .3 probability so it’s in the second row, etc.

You can visualize the confidence sets by drawing an outline around the coin/outcome pairs that are included in your strategy.

Things go well for a while, but then your boss comes around again and says “*People want more of a challenge!*” You’re given 19 coins with each of the probabilities .05, .10, .15, …, .95 and told to increase the number of coin flips from 5 to 40.

At this point, it would be tedious to look at tables of numbers, but you can still visualize things:

You can use the same greedy strategy of including elements from each row until you get a sum of 0.9. If you do that, this is what you end up covering:

Notice: For any given outcome, the set of coins that you include are always next to each other. This happens just because for each coin, there’s a single mode of probability around a given outcome, and the location of this mode changes smoothly as the bias of the coin changes. This is why we can talk about confidence *intervals* rather than confidence *sets*: The math happens to work out in such a way that the included coins are always next to each other.

Finally, your boss suggests one last change. The game should work this way:

- Each guest is given a soft-metal coin, which they can bend into whatever shape they want.
- That coin is flipped 40 times, and the outcome is announced.
- You need to guess some
*interval*that hopefully contains the true bias of the coin. - The coin is CT-scanned, and the carnival’s compute cluster finds the true bias. If it’s
*not*in the interval you guessed, the guest can go on the ride.

Thinking about how to address this came, it occurs to you that you can make figures in the same way with any number of coins, and if you use a fine enough grid, you will cover all possibilities. The following figure shows what you get if you use the same process with 1001 coins ranging from 0.000, 0.001, 0.002, …, 1.000.

Now, remember, where we started: We tested 40 cells and found that 16 of them had changed, and an online calculator told us that with 90% confidence the true fraction was between 26.9% and 54.2%.

To understand where these numbers come from, just take this figure and put a vertical line at # heads = 16:

What’s included is all the coins with biases between .269 and .542. *That’s* why the confidence interval for 16 out of 40 is 26.9% to 54.2%.