Is anchoring a reliable cognitive bias?

Say you get a large group of people and split them into three groups. The first group gets this question:

What is the length of the Mississippi river?

The second group gets these questions:

Is the Mississippi river greater or less than 70 miles long?

What is the length of the Mississippi river?

The third group gets these:

Is the Mississippi river greater or less than 2000 miles long?

What is the length of the Mississippi river?

When Jacowitz and Kahneman tried this in 1995, they found that the groups had median estimates of 800 miles, 300 miles, and 1500 miles, respectively.

In theory at least, the initial greater / less questions provide no information, and so shouldn’t change the final estimates people give. But they do change them, a lot.

This effect is known as anchoring. It’s widely discussed, but the replication crisis has made me paranoid cautious about trusting anything that hasn’t been replicated. So I tried to find anchoring studies with replications.

Anchoring for estimation

A seminal paper from 1995 is Measures of Anchoring in Estimation Tasks by Jacowitz and Kahneman. They created a list of 15 quantities to estimate and took 156 U.C. Berkeley students who were taking an intro to psychology course. They had three questionnaires, one—for calibration—simply asked them to estimate everything. The other two questionnaires first asked if each quantity was more or less than a certain anchor and then asked for an estimate of the quantity.

The results were that anchoring had a massive effect:

Here, the left column shows the median among people who weren’t given an anchor, the next two columns show what numbers were used as low/high anchors, and the last two columns show the median responses among people given low/high anchors.

But does it replicate?

The Many Labs project is an effort from a team of psychologists around the world. In their first paper (Klein et al., 2014) they tried to replicate 13 different well-known effects, including anchoring. They had a large sample of 5000 people from around the world and gave them four estimation problems.

There was one small difference from the original study. Rather than being asked if the population of Chicago was greater or less than the anchor, participants were simply informed of the truth.

And what are the results? You have to dig to find the raw data, but I was able to put it together from and

Statement Low Anchor High Anchor Mean with low anchor Mean with high anchor
Babies born per day in US >100 200k 2k ft 1.5k miles

The effect is real. It replicates.

Effect size mystery

There is a mystery, though. In the Many Labs paper, almost all the replications find smaller effect sizes than the original papers, but anchoring stands out as having a much larger effect size:

In the additional material on anchoring, they speculate about what might be causing the effect to be larger. This is all pretty strange, because here are the raw numbers in the original paper compared to those in the replication:

Quantity Low anchor High anchor
Babies born per day (original) 1k 40k
Babies born per day (replication) 3.2k 26.7k
Population of Chicago (original) 600k 5.05m
Population of Chicago (replication) 1.03m 3m
Height of Mt. Everest (original) 8k 42.55k
Height of Mt. Everest (replication) 11.8k 34.5k
SF to NYC distance (original) 2.6k 4k
SF to NYC distance (replication) 2.85k 3.99k

In every case, the effect is smaller, not larger.

What’s going on? After puzzling over this, I think it comes down to the definition of an “effect size”. It’s totally possible to have a smaller effect (in the normal English-language sense) whilst having a larger effect size (in the statistical sense).

One possibility is that it’s an artifact of how the effect size is calculated. They report, “The Anchoring original effect size is a mean point-biserial correlation computed across 15 different questions in a test-retest design, whereas the present replication adopted a between-subjects design with random assignments.”

Another possibility is that it comes down to the definition. Remember, the effect size is a technical term: The difference in means divided by the standard deviation of the full population. The above table shows that the difference in means is smaller in the replication. So why is the effect size larger? Well, it could be that even though the difference of means is smaller, the standard deviation in the replication is much smaller, meaning the overall ratio is larger.

I can’t confirm which of these explanations is right. This document seems to include standard deviations for the replication, but the original study doesn’t seem to list them, so I’m not sure.

In any case, I don’t think it changes the takeaways:

  1. Anchoring for estimation tasks is real, and robust, and has large effects.
  2. Be careful when looking at summary statistics. The further something is from raw data, the easier it is to misinterpret.

Bayesian view

Anchoring is real, but someone determined to see the entire world in Bayesian terms might chafe at the term “bias”. This ostensibly-not-me person might argue:

“You say anchoring is a ‘cognitive bias’ because being asked ‘Is the Mississippi river greater or less than 70 miles long?’ is a question, and thus contains no information, and thus shouldn’t change anything. But are you sure about that?”

Let me give an analogy. In a recent homework assignment, I asked students to find the minima of a bunch of mathematical functions, and at the end gave this clarification:

“In some cases, the argmin might not be unique. If so, you can just give any minima.”

Some students asked me, “I’m finding all unique minima. What am I doing wrong?” I tried to wriggle out of this with, “Well I never promised there had to be some function without a unique minima, so that’s not necessarily a problem…” but my students weren’t fooled. My supposedly zero-information hint did in fact contain information, and they knew it.

Or say you’re abducted by aliens. Upon waking up in their ship, they say, “Quick question, Earthling, do you think we first built the Hivemind Drive on our homeworld more or less than 40 million years ago?” After you answer, they then ask, “OK, now please estimate the true age of the Drive.” Being aware of anchoring, wouldn’t you still take the 40 million number into account?

So sure, in this experiment as performed, the anchors made estimates worse. But that only happened because the anchors were extremely loose. If the experiment were repeated with anchors that were closer to the true values, they would surely make the estimates better. It’s not at all clear that being “biased” is a problem in general.

Arguably, what’s happening is simply that people had unrealistic expectations of how experimenters would choose the anchor values. And can we blame them? The true height of Mt. Everest is 29k feet. Isn’t it a little perverse to ask if it is greater than 2k feet? In Bayesian terms, maybe it was very unlikely to choose such misleading anchors according to people’s expectations of how anchors would be selected.

Here’s another clue that people might be doing Bayesian updating: The anchoring effect is larger for quantities people know less about. Anchoring has a huge effect when estimating the number of babies born per day (which most people have no idea about) but a small effect on the distance from San Francisco to New York (where people have some information). That’s exactly what Bayesian inference would predict: If your prior is more concentrated, the likelihood will have less influence on your posterior.

I’m sure this has been suggested many times in the past, but to my surprise, I wasn’t able to find any specific reference. The closest thing I came across was a much more complex argument that anchoring stems from bounded rationality or limited computational resources.

Irrelevant anchors

Some people have suggested that even irrelevant numbers could act as anchors and change what numbers people choose. The most well-known example is a 2003 paper by Ariely, Loewenstein, and Prelec. They took 55 MBA students and showed them descriptions of six different products—a cordless keyboard, a bottle of wine, etc. Students were asked two questions:

  1. To get the product, would you pay X dollars? (Where X was the last two digits of the student’s social security number.)
  2. What is the maximum amount you would pay for the product?

Ideally, to make sure people were honest, you’d have them actually pay for the products. For whatever reason, that wasn’t done. Instead, there was a complex mechanism where depending on randomness and the choices people, people might get the product. This mechanism created an incentive to be honest about how much they valued things.

It’s unclear exactly how the experimenters got the social security numbers. Most likely subjects just wrote them down before the first question.

This experiment found a strong relationship between social security numbers and the numbers students gave for the second question.

The effect was huge and statistically significant for every product.

But does it replicate?

Some replications look good. In 2010, Bergman et al. tried to replicate this study using a very similar design. They are very clear about how their anchoring works: Students first write down the last two digits of their social security number (in Swedish Krona) and then state if they would pay that much. These results were fairly similar to the original study.

In 2017, Li et al. tried to replicate this result by having college students at Nankai University (near Beijing) write the last two digits of their phone number as an anchor, so anchors ranged from ¥0 to ¥99 (around $15). The purpose of this experiment was to test if the strength of the anchoring effect could be modulated by tDCS (applying a current to the head via electrodes, yes really). But ignoring that test, here is what they found for the 30 people in the control group—or, more precisely, the 30 people who had just undergone a sham version of tDCS with no actual stimulation:

Although the p-values are all above 0.05, this still looks like a positive relationship for chocolate (Ferrero Rocher) and maybe the novel (One Hundred Years of Solitude) but not for the wine. That could be related to the fact that the true value of the wine (¥150) was more than the largest possible anchor value of ¥99.

But seriously, does it replicate?

Other replications that don’t look so good. In 2012, Fudenberg, Levine, and Maniadis did one using 79 students from UCLA. One difference was that, rather than using social security numbers, the replication rolled two dice (in front of the participants) to create anchor values. A second difference was that participants in the replication were giving numbers they would sell the product at, rather than values they would buy them at. (With another complex mechanism.) They found nothing.

Another experiment was done by Ioannidis and Offerman in 2020. This had three phases. First, people were “given” a bottle of wine (on their screen) and were asked if they would sell the wine for some value between €0 and €9.9, chosen randomly by rolling two dice. Second, people participated in an auction for the wine with other participants. Finally, they were asked what value they would accept in lieu of the wine.

Their goal was to test if the auction phase would reduce the anchoring effect. Their results were that—umm—there was no anchoring effect, and so the market didn’t have anything to reduce.

Here’s how I score the replications:

What to make of this? One theory is that it depends on if people were choosing values to “buy” or “sell” the good at. (Or rather, mechanistic proxies for buying and selling.)

Another theory, suggested by Ioannidis and Offerman, is that it depends on how the anchor is chosen. The original study and the successful replication both used social security numbers. The half-successful replication used phone numbers instead. Both of the failed replications used dice, clearly demonstrated in front of the participants to be random.

Why would this matter? Well, Chapman and Johnson (1999; Experiment 3) did a related experiment with anchors and social security numbers. They found a correlation, but then also took the brave step of asking the participants what they thought was going on. Around 34% of participants said they thought the anchor values were informative and that that they thought the experimenters wanted them to be influenced by the anchor values! So perhaps anchoring on irrelevant information has an effect because participants for some reason think it is relevant.

I’m not sure if I believe this. For one thing, how could social security numbers be informative? It doesn’t make sense. For another, 34% of people doesn’t seem large enough to produce the reported effects. And third, when Chapman and Johnson broke things down, they found that among people who said anchor values were informative, the correlation was four times weaker than among those who didn’t say that!


So do people really anchor on irrelevant information?

When I first came across Fudenberg et al.’s failed replication, I assumed the effect wasn’t real. But if that were true, how can we explain one and a half successful replications? It’s very puzzling.

If the effect is real, it’s fragile. Either it depends on people mistakenly thinking the irrelevant information is relevant, or it depends on people wanting to please the experimenter, or it depends on if people are “buying” or “selling” something, or it depends on some other unclear details. Very small changes to how the experiment is done (like using dice instead of social security numbers) seem to make the effect vanish.

As things currently stand, I don’t think there’s any real-world scenario in which we can be confident that people will anchor on irrelevant information (other than, perhaps, exactly repeating Ariely et al.’s experiment). Hopefully future work can clarify this.

More confidence games

You drew 40 random cells from a sample and found that a new drug affected 16 of them. An online calculator told you:

“With 90% confidence, the true fraction is between 26.9% and 54.2%.”

But what does this really mean? We’ve talked about confidence sets before—read that post first if you find this one too difficult. Now let’s talk about intervals.

Confidence intervals

Years from now, all Generation β does is sit around meditating on probability theory and reading Ars Conjectandi. You work at a carnival where one day your boss says, “Our new ride is now so popular that we only have capacity for 10% of guests. I’ve got some bent coins of different colors, and I want you to create a guessing game to ration access to the ride.

Here’s how the game is supposed to work:

  1. The guest will pick one coin, flip it, and announce the outcome.
  2. Based on that outcome, you guess some set of colors.
  3. The true color of the coin is revealed. If it’s not in the set of colors, the guest can go on the ride.

It’s essential that no matter what the guests do, only 10% of them can win the game. But otherwise, you’d like to guess as few colors as possible to better impress the players.

You’re given five bent coins, which your boss has CT-scanned and run exhaustive simulations to find the true probability each will come up heads.

Coin Prob. tails Prob. heads
red .9 .1
green .7 .3
blue .5 .5
yellow .3 .7
white .1 .9

Your first idea for a game is the obvious one: Flip it, and try to guess the color based on the outcome of heads or tails. While you could do that it would be extremely boring.

Then, you have another idea. Why not flip the coin twice? Take the red coin. It’s easy to compute the probability of getting different numbers of heads:

  • 0 heads: (0.9)2 = 0.81 (The probability of rolling tails twice in a row.)
  • 1 head: 0.1 × 0.9 + 0.9 × 0.1 = 0.18. (The probability of either rolling heads × tails plus the probability of rolling tails × heads.)
  • 2 heads: (0.1)2 = 0.01 (The probability of rolling heads twice in a row.)

Continuing this way, you make a table of the probability of getting a total number of heads after two coin flips for each of the coins:

0 heads 1 head 2 heads
red .81 .18 .01
green .49 .42 .09
blue .25 .50 .25
yellow .09 .42 .49
white .01 .18 .81

You could make a game based on two coinflips, but why not make things even more interesting? Why not spice things up even more by flipping the coin, say, 5 times. You do a little bit of research, and you discover that the probability of getting a total of tot-heads heads after doing num-flips flips of a coin with a bias of prob is called a Binomial, namely Binomial(tot-heads | num-flips, prob). For example, the probabilities we calculated above are Binomial(0 | 2, 0.1)=0.81, Binomial(1 | 2, 0.1)=0.18, and Binomial(1 | 2, 0.1)=0.01. If num-flips is larger than 2 the math gets more complicated, but who cares? You find some code that can compute Binomial probabilities, and you use it to create the following table of the probability of getting each total number of heads after 5 coinflips:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This seems to make sense: The most likely outcome for the red coin is all tails, since the red coin is rarely heads. The most likely outcomes for the blue coin are nearly evenly distributed, while the most likely outcome for the white coin is all heads.

Now, what colors should you guess for each outcome? Again, you need to make sure that, no matter what color the guest chooses, you will include that color with 90% probability. This is equivalent to covering .9 of the probability from each row. You decide to go about this in a greedy way. For each row, you add entries from largest to smallest until you get a total that’s above 0.9. If you do that, you get this result:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This corresponds to the following confidence sets:

Outcome What you guess
0 {red, green}
1 {red, green, blue}
2 {green, blue, yellow}
3 {green, blue, yellow}
4 {blue, yellow, white}
5 {yellow, white}

Remember what we stressed last time: When we get 4 heads and say we are “90% confident” the color is blue, yellow, or white, we don’t mean “90% probability”, we just mean that your guessing procedure will work for 90% of guests. After seeing 4 heads, the probability of given colors is—according to the worldview of confidence intervals—meaningless because it’s a fixed quantity. And even if you’re willing to talk about probabilities in such situations, the probability could be much higher or lower than 90%.

It occurs to you that you can visualize this as a heatmap, with lighter colors representing higher probabilities. The entries in the following figure are laid out in the same way as the above table. Remember that red has a .1 probability of .being heads so it is in the first row, green has a .3 probability so it’s in the second row, etc.

You can visualize the confidence sets by drawing an outline around the coin/outcome pairs that are included in your strategy.

Things go well for a while, but then your boss comes around again and says “People want more of a challenge!” You’re given 19 coins with each of the probabilities .05, .10, .15, …, .95 and told to increase the number of coin flips from 5 to 40.

At this point, it would be tedious to look at tables of numbers, but you can still visualize things:

You can use the same greedy strategy of including elements from each row until you get a sum of 0.9. If you do that, this is what you end up covering:

Notice: For any given outcome, the set of coins that you include are always next to each other. This happens just because for each coin, there’s a single mode of probability around a given outcome, and the location of this mode changes smoothly as the bias of the coin changes. This is why we can talk about confidence intervals rather than confidence sets: The math happens to work out in such a way that the included coins are always next to each other.

Finally, your boss suggests one last change. The game should work this way:

  1. Each guest is given a soft-metal coin, which they can bend into whatever shape they want.
  2. That coin is flipped 40 times, and the outcome is announced.
  3. You need to guess some interval that hopefully contains the true bias of the coin.
  4. The coin is CT-scanned, and the carnival’s compute cluster finds the true bias. If it’s not in the interval you guessed, the guest can go on the ride.

Thinking about how to address this came, it occurs to you that you can make figures in the same way with any number of coins, and if you use a fine enough grid, you will cover all possibilities. The following figure shows what you get if you use the same process with 1001 coins ranging from 0.000, 0.001, 0.002, …, 1.000.

Now, remember, where we started: We tested 40 cells and found that 16 of them had changed, and an online calculator told us that with 90% confidence the true fraction was between 26.9% and 54.2%.

To understand where these numbers come from, just take this figure and put a vertical line at # heads = 16:

What’s included is all the coins with biases between .269 and .542. That’s why the confidence interval for 16 out of 40 is 26.9% to 54.2%.

Confidence games

Say you’re developing a new anti-cancer drug. You apply it to some cell line, draw 40 random cells, and manually inspect them. You find that the drug changed 16 of the 40, suggesting the drug is around 40% effective. But of course, this is just an estimate. So you plug the numbers into an online calculator which tells you something like this:

With 90% confidence, the true fraction is between 26.9% and 54.2%.

OK, but what does this actually mean? Here "confidence" is a technical work, with a precise (and somewhat subtle) meaning. Almost every scientist interacts with confidence intervals, but surveys show the vast majority don’t fully understand them (Hoekstra et al., 2014, Lyu et al. 2020).

I have a theory: Confidence intervals can be explained—really explained—in a fully conceptual way, without using any math beyond arithmetic. The core idea to first look at confidence sets in cases where everything is discrete. In these cases, everything can be laid out in a table, and the core difficulty just making sure you don’t confuse the rows and the columns.

The year is 2052. The youth have grown tired of their phones and now only care about probability and game theory. You work at a carnival. One day your boss comes over and says, "Listen, we don’t have enough capacity on our new ride. To pander to the kids and their damn probabilities, we’re going to make a game, where winners can go on the ride. You’re going to run this game.

Your boss outlines some (slightly strange) rules for how the game is supposed to work:

  1. The guest picks one of a few 4-sided dice, each of which has a different color and weight distribution.
  2. The guest will roll that die, and the outcome (⚀, ⚁, ⚂, or ⚃) is announced.
  3. Based on that outcome, you need to guess some set of colors.
  4. The true color of the die is revealed. If it’s not in the set of colors, the guest can go on the ride.

For example, say that after the die is rolled you guess {red, blue}, and the true color turns out to be red. Then the guest wouldn’t get to go on the ride. If you’d guessed {green} and the true color were blue, the guest would get to go on the ride

Your boss stresses two things: First, the ride only has capacity for 30% of the guests. Second, you should keep your sets as small as possible (to keep things interesting).

Here are the six dice, which the carnival’s lab has helpfully CT scanned and run simulations on to calculate the true probability that each die will come up with each side:

red .7 .1 .1 .1
green .1 .7 .1 .1
blue .1 .1 .7 .1
yellow .4 .3 .2 .1
white .1 .2 .4 .3

What should your strategy be? You reason as follows:

  1. For each outcome (⚀, ⚁, ⚂, or ⚃) you need to choose some set of colors. So choosing a strategy is equivalent to choosing some subset of entries in the above table.
  2. The guests will share information and use their cursed game theory to maximize their chances. If any color gives them a better chance to get on the ride, they’ll all pick that color. So, it’s necessary that for each row, the sum of probabilities in the columns you include must add up to at least .7.

The obvious choice is the following strategy, where the included entries are bold.

red .7 .1 .1 .1
green .1 .7 .1 .1
blue .1 .1 .7 .1
yellow .4 .3 .2 .1
white .1 .2 .4 .3

Or, you can picture your strategy as a list of outcomes, and what you guess for each. These guesses are called confidence sets.

Outcome What you guess
{red, yellow}
{green, yellow}
{blue, white}

What can we say about this? You have the guarantee your boss asked for: No matter what color die the guest chooses, the set you guess will include the true color 70% of the time.

THAT’S ALL WE CAN SAY. THAT AND NOTHING MORE. Say the guest chooses a die and rolls ⚁. You might be tempted to say "With 70% probability, the true color is either green or white." Wrong. I know from experience that many people don’t want to accept this and that after being told this is wrong they look for a way out, a way to escape this harsh reality. Stop looking. You cannot talk about the probability of the die being a given color because the guest already chose it. It’s a fixed quantity, you just don’t happen to know what it is.

In the worldview of confidence sets, it’s nonsensical to talk about probabilities of fixed unknown things. That would be like talking about the "probability" that George Washington was born in 1741, or the "probability" that France is larger than Germany. These things don’t have probabilities because they aren’t repeatable random events. To be sure, many people are actually fine with using probabilities like that (they’re called subjectivists) but by definition, confidence sets don’t use probability like that. If you want to roll that way, then you’re a Bayesian (congratulations!), but you’ll still want to understand confidence when other people talk about.

Still don’t believe me? Say you’re OK with subjective probabilities and say that guests choose dice randomly so that the prior probabilities of colors are the same. Now suppose the guest rolls ⚃. Would you be tempted to say that there is a 70% chance the true color was white? In this situation, the posterior probability of each color is proportional to the chance that color rolls ⚃. Look at the right column of the big probability table. There’s more ways to get ⚃ via other colors than via white. Dividing by the sum of all entries in the table, the posterior probability of white is 3/8 =.3 / (.1 × 5 + .3). It’s not even half! In the same way, we can calculate the posterior probability that the confidence set contains the true color for each outcome by summing up the bold entries in each column and dividing by the sum of all the entries:

Outcome Prob guess includes true color
11/14 = 0.785
10/14 = 0.714
11/15 = 0.733
3/8 = 0.375

Sometimes it’s higher and sometimes it’s lower, but it’s never 70%. And remember, this is assuming that guests choose dice completely at random, which they don’t.

So, while it’s tempting to talk about probabilities, we can’t. What do we do instead?

If we wanted to be as confusing as possible, we’d pick a word that in English sounds like it means probability, and then we’d use it a sentence as if it was probability, but the whole time we’d be referring to a completely different concept that absolutely is not a probability. Well, umm, that’s what we do: Given a roll of ⚁, we say:

“With 70% confidence, the true color is either green or white.”

This sentence is designed to mislead you. It does not mean anything similar to what it would mean as a normal English sentence. It is just a shorthand for this:

“We have a procedure that maps dice rolls to sets of colors. We’ve designed the procedure so that, if we roll any of the dice millions of times and compute the corresponding sets of colors, at least 70% of those sets will contain the true color. For the dice roll we observed in this particular instance our procedure maps the outcome to {green, white}.”

Say that the carnival opens for the day, and we carefully record everything that happens.

Guest True color Outcome Guess
Amy red {red, yellow}
Bob yellow {green, yellow}
Carlos green {green, yellow}
Zander white {white}

All we can guarantee is that, in the long run, 70% of the guesses will contain the true colors. We can’t say anything about the particular probability in any particular row because:

  1. Working with confidence sets means acceptance of a worldview in which it’s meaningless to talk about subjective probabilities of things that have already happened, just because you happen not to know what those things are.
  2. Even if you were willing to talk about subjective probabilities, you can’t do it because you’d need a prior distribution over the different colors.
  3. Even if you have a prior distribution over the colors and calculate the probabilities, they might be much larger or smaller than 70%.

When talking about confidence, we give no guarantee—none!—about what the actual true color is in any particular instance. All that we guarantee is that in the multiverse of different worlds that branched out when the die was rolled, the true color is in the set in 70% of them. But you’re in one specific universe that may or may not be in that 70%.

This leads us to what I think is the core confusion about confidence sets:

The guarantees run over the rows of the big probability table, not over the columns.

We only make guarantees about what happens if you run the experiment many times, i.e. for the rows. We guarantee nothing about the true color for any outcome, i.e. for the columns. That’s quite annoying. I think part of what makes it confusing is that’s it’s so different from what people actually want. You probably want to know what the color is in this world—why are we talking about other branches of the multiverse? Well, because that’s because all we can do.

Still tempted to think about confidence as probabilities? Here’s one last illustration. We could have used a different strategy, where for white we include ⚀ and ⚁ instead of ⚃:

red .7 .1 .1 .1
green .1 .7 .1 .1
blue .1 .1 .7 .1
yellow .4 .3 .2 .1
white .1 .2 .4 .3

This is a "worse" strategy in the sense that we tend to have larger sets that won’t impress the guests. But we still include .7 probability from each row, so it’s still valid.

Here are the corresponding confidence sets:

Outcome What you guess
{red, yellow, white}
{green, yellow, white}
{blue, white}

Now suppose we play the game, and the outcome is ⚃. Then we can say:

“With 70% confidence, the true color is nothing.”

This is perfectly valid! Obviously, the true color is never nothing. But we are allowed to say this because we are using a procedure in which the above statement doesn’t occur very often. When you talk about "70% confidence" you promise that most of the statements you make are true, but you can be completely, arbitrarily wrong 30% of the time.

Note: There’s a follow up post with more confidence games.

The replication game

The replication crisis in psychology started with a bang with Daryl Bem’s 2011 paper Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. In this paper, Bem performed experiments that appeared to prove the existence of extrasensory perception (ESP). For example, subjects could guess (better than chance) where erotic images would appear on a screen. There were thousands of subjects. The paper used simple, standard statistics methods, and found significance for eight of nine experiments. Bem is a respected researcher, who no one suspects of anything underhanded.

The problem, of course, is that ESP isn’t real.

The psychology community was shaken. A smart person had carefully followed standard research practices and proved something impossible. This was ominous. Bem’s paper was only questioned because the result was impossible. How many other false results were lurking undetected in the literature?

Today, we’d say he had wandered the garden of forking paths: If you look at enough datasets from enough angles, you’ll always find significance eventually. It’s easy to do this even if you have the best intentions.

Many psychologists saw the Bem affair as a crisis and quickly moved to assess the damage. In 2014, an entire issue of Social Psychology was dedicated to registered replications. Each paper was both a replication (a repeat of a previous experiment) and pre-registered (the research plan and statistical analysis were peer-reviewed before the experiment was done).

So how well did existing results replicate? In their introduction to the issue, Nosek and Lakens are rather coy:

This special issue contains several replications of textbook studies, sometimes with surprising results.

Awesome, but which ones were surprising, how surprising were they, and in what way?

I wanted a number, a summary, something I could use as a heuristic. So, I read all the papers and rated every experiment in every paper as either a full success (★), a success with a much smaller effect size (☆), or a failure (✗). I used the summary from the authors if they gave one, which they often didn’t, probably because people are squeamish about yelling my famous colleague’s Science paper is bogus too loudly. Then, for each effect, I averaged the scores for all the experiments using ✗=0, ☆=2, and ★=4.

So, let’s play a game.

I’ll give a short description of the effects that were tested. You can predict how well each replicated by choosing an integer between 0 (failure) and 4 (full success). You can score yourself manually, or there’s a form you can use below.

(Or, if you’re the impatient type, feel free to jump to the summary.)

The effects

The first few effects are textbook results. (📖)

(1) The primacy of warmth is the idea that when forming impressions of others, the "warmth" of someone’s personality is more important than competence. This claim goes back to Solomon Asch in 1946 where people were given lists of various attributes for people like "intelligent, skillful, industrious, warm, determined, practical, cautious" and asked which was most important.

(2) Deviation-rejection is the idea that if someone stubbornly holds an opinion contrary to a group consensus will ultimately be rejected from that group. Schachter (1951) ran focus groups to discuss what to do with a juvenile delinquent. Most people are lenient, but a confederate in the group always supported a harsh punishment. This confederate was talked to less over time, and chosen less for future groups.

(3) The Romeo and Juliet effect is the idea that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972), who found that parental interference was correlated with love and trust in relationships and that, over time, increases in interference were correlated with an increase in love.

(4) Single exposure musical conditioning was investigated by Gorn (1982) by randomly showing people either a blue or beige pen, while playing either music they liked or disliked. Later, people could choose a pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, while only 30% of people who hear music they disliked did the same.

The next few effects concerned papers where the literature is somewhat in conflict or where there are moderators with unclear impact. (⚔️)

(5) Stereotype priming was investigated by Blair and Banaji (1996) who found that briefly flashing a stereotypically "male" or "female" word in front of someone changed how long it took to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found an effect even if the discrimination was unrelated to the primes: "Male" words made participants faster to recognize male names even if the task was to discriminate city names vs. first names.

(6) Stereotype susceptibility (or stereotype threat) is the idea that cognitive performance is influenced by stereotypes. Shih et al (1999). took Asian-American women and had them take a math test, while being primed to either think about being women or about being Asian. The Asian-primed group 54% of the questions right while the female-primed group got only 43% right. However, these results came from a small sample, and results in real-world applications have been mixed.

(7) Sex differences in distress from infidelity. A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at a population with a mean age of 20 years asked if people would be more distressed if their partner had passionate sex with someone else or formed a deep emotional attachment. 76% of young men chose sex, as opposed to 32% of women. In an older population with a mean age of 67 years, 68% of men chose sex, as opposed to 49% of women.

(8) Psychological distance and moral judgment is the idea that people make more intense moral judgments (positive or negative) for more psychologically distant acts. Eval et al. (2008) had participants make judgments about people who committed various acts (incest, eating a dead pet, donating to charity) varying if it happened in the near vs. distant future. People judged those in the distant future around 1.5 points more harshly on a -5 to +5 scale. However, Gong and Medin (2012) performed a similar experiment and came to the opposite conclusion.

Finally, there were recent studies that had attracted a lot of attention. (🔥)

(9) Does cleanliness influence moral judgments? Schnall et al. (2008) found that people make less severe moral judgments when they feel clean. They primed people with words and then asked people to judge a set of moral dilemmas, e.g. trolley problems or faking information on a resume. They found that people primed with "clean" words judged people around 1 point less harshly than those primed with neutral words, on a 7-point scale. In a second experiment, people saw a clip from Trainspotting with a man using an unclean toilet and were asked to wash their hands (or not) before being given the same moral dilemmas. Those who washed their hands judged people around 0.7 points less harshly.

(10) Does physical warmth promote interpersonal warmth? Williams and Barg (2008) gave participants either a cold pack or a heat pack to evaluate after which they could choose a gift either for themselves or for a friend. Those given the heat pack were around 3.5x more likely to choose a gift for a friend.

(11) Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable. Sachdeva et al. (2009) had participants write a short story about themselves using words that were either positive words ("caring", "generous") or negative words ("disloyal", "greedy"). Afterward, they asked how much they would be willing to donate to charity. Those who used the positive words donated a mean of $1.07, while those who used negative words donated $5.30.

(12) Can superstition improve performance? Damish et al. (2010) gave subjects a golf ball and asked them to attempt a 100cm putt. Subjects told the ball had been lucky so far were able to make 65% of putts, as opposed to 48% of controls.

(13) Is moral behavior connected to brightness? Banerjee et al. (2012) asked participants to describe something they did in the past that was either ethical or unethical, then asked them how bright their room was. Those in the ethical group described their room as around 0.6 points brighter on a 7-point scale. They also judged various light-emitting products (lamps, candles, flashlights) as around 2-point more desirable on a 7-point scale.


To play the game fill out this Google form quiz thing. f

Or, you can score yourself. In the following table, record your absolute error for each effect. (If you guessed 1 and the correct answer was 4, you get 3 points.) Then add all these errors up.

The best possible score is 0, while the worst is 41.

Here are the results for each effect, ranging from ○○○○ (failure) to ●●●● (full replication).

Effect Type Result
Primacy of warmth 📖 ○○○○
Deviation-rejection 📖 ●●○○
Romeo and Juliet effect 📖 ○○○○
Single exposure musical conditioning 📖 ●●●●
Stereotype priming ⚔️ ●○○○
Stereotype susceptibility ⚔️ ●○○○
Sex differences in distress from infidelity ⚔️ ●○○○
Psychological distance and moral judgment ⚔️ ●●○○
Cleanliness and moral judgments 🔥 ○○○○
Physical warmth and interpersonal warmth 🔥 ○○○○
Moral Licensing 🔥 ○○○○
Superstition and performance 🔥 ○○○○
Moral behavior and brightness 🔥 ○○○○

What to make of this?

Obviously, the replication rate is strikingly low. Only one effect replicated at the original effect size, and even this has some caveats (below). The average replication score was 0.8 / 4. And one effect (Romeo and Juliet) deserves a negative score: It didn’t just fail to replicate, there was a large effect in the opposite direction.

Also striking is that all of the new hot effects (🔥) failed—not one experiment in one paper replicated even at a smaller effect size. Strangely, all the conflicted papers (⚔️) replicated at least a little bit.

I don’t think things are nearly as bad as this table suggests. For one thing, if you were a psychologist, what would make you perform a registered replication? Personally, I’d be motivated by something like this:

someone is wrong on the internet

I’d do it because I thought some famous effect wasn’t real. So this is a highly non-random sample. The correct conclusion is less psychology is all wrong and more when psychologists think an effect is bogus, they’re usually right.

Second, these results are mostly social psychology. You, dear reader, are probably not a social psychologist, but rather someone interested in the broad art of predicting stuff, which is the domain of cognitive psychology. (At least, unless you’re predicting how people interact in groups.) The Open Science Collaboration in 2015 tried to reproduce a bunch of recent papers in social and cognitive psychology. They found that around 28% of social effects subjectively "replicated", as opposed to around 53% of cognitive effects. They also found that replication effect sizes were much closer to the original effect sizes for cognitive psychology (a ratio of around 0.56 as compared to 0.31 in social psychology).

The replications

Here are more details about what happened in each of the replications, and justifications of the ratings I gave. (Click to expand.) If you’re the author of one of these replications and I got anything wrong, I’d love to hear from you.

The primacy of warmth

This claim goes back to Solomon Asch in 1946. The idea is that, when forming impressions of people, warmth-related judgements are more important than competence. Nauts et al. replicated Asch’s experiments. They showed different people various lists of traits, such as the following

  • (Condition 1) Intelligent, skillful, industrious, warm, determined, practical, cautious
  • (Condition 2) Intelligent, skillful, industrious, cold, determined, practical, cautious
  • (Condition 3) Obedient, weak, shallow, warm, unambitious, vain
  • (Condition 4) Vain, shrewd, unscrupulous, warm, shallow, envious
  • (Condition 5) Intelligent, skillful, sincere, cold, conscientious, helpful, modest

The fraction of people that chose "warm" or "cold" as the most important trait was as follows:

Condition most chosen trait choosing warm/cold
1 intelligent 55.3% 19.5%
2 intelligent, 36.2% 30.0%
3 obedient 21.7% 7.0%
4 vain 44.0% 6.6%
5 intelligent 53.5% 7.8%

According to Asch’s theory, people should choose "warm" and "cold" as the most important traits in conditions 1 and 2, but not in the others. It is true that warm/cold were considered more important in these conditions, but they were never the most common choice as the most important trait. Nauts et al. call this a clear failure of this particular experiment to replicate, though they emphasize that other research makes clear that warmth is indeed primary in many circumstances.

Summary: ✗ Total replication failure.


Wesselmann et al. investigated "deviation-rejection", or the claim that someone who holds an opinion contrary to a group consensus will ultimately be rejected from that group. Following Schachter (1951), they created groups of 10 people, consisting of 7 experimental subjects and three confederates. Everyone was given a case study of a juvenile delinquent named Johnny Rocco, then asked everyone how Johnny should be treated, followed by a discussion. Most subjects are lenient. The mean confederate followed the current group consensus. The slider confederate first supported harsh treatment, then gradually shifted towards leniency. The deviate confederate always said to punish Johnny and never changed.

They looked at three claims made by Schachter. First, did people communicate with the deviate confederate less over time? They did seem to, though the data was noisy.

Second, they looked at if people would assign the confederates to leadership roles in future committees. Contra Schachter, they found no effect.

Third, they had people rank each person for how much they’d like to have them in future groups. On a scale of 1-3, the slider got a score of 1.74, the mode a score of 1.91, and the deviate a score of 2.34. (So people dislike nonconformists, but like people who are willing to change their mind?) This replicates Schachter again, albeit with a significantly smaller effect size.

Summary: ☆ Replication mostly succeeded, but with a smaller effect size

The Romeo and Juliet effect

The Romeo and Juliet effect is the claim that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972). Who found that parental interference was correlated with love and trust, and also that over time, increases in parental interference were correlated with increases in love.

Sinclair et al. replicated these experiments. They found people online who were in relationships and asked them about their relationship, e.g. how much they loved their partner, how much they trusted their partner. They also asked how much friends or parents disapprove of their relationship. They contacted these people again around 4 months later. They then looked at correlations.

Their results were the opposite of what Discoll et al. found. Greater approval from friends and parents was associated with higher relationship quality, not lower. And increased parental disapproval was correlated with decreased love.

Summary: ✗ Calling this just a failure is generous. The original effect not only failed to replicate, but the opposite effect held (with a large effect size and high statistical significance).

Single-exposure musical conditioning

Gorn (1982) showed were randomly shown either a blue or beige pen, while playing either music they liked or disliked. Then, later, they could choose one pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, compared with only 30% of people who heard music they disliked.

Vermeulen et al. reproduced this result. In a first experiment, they used similar music to the original music from 1982: They first used Summer Nights from Grease as the "liked" music, and Aaj Kaun Gali Gayo Shyam by Parveen Sultana as the "disliked" music. They confirmed that people really did like the Grease music more than the other song (they had mean ratings of 3.72 vs 2.11 on a scale of 1-7)

After this, they repeated the same study, but with an actor pretending to be a researcher from another department and a post-experimental questionnaire. This again found no real effect.

Finally, maybe the problem was just using old music that college students are unfamiliar with? To test this, they got two different renditions of Rihanna’s We Found Love by cover artists, and selected one (that people liked) as the "liked" music and one (that people didn’t like) as the "disliked" music. People liked the good rendition much more than the other (mean score of 5.60 vs. 2.48).

For this third experiment, they ran out of students in the class, and so had fewer subjects than planned. Still, they found that 57% of people who heard the "liked" music chose the pen on screen, as opposed to only 23% of people who heard the "disliked" music. Despite the smaller sample, this was still highly significant, with p=.003.

It’s unfortunate that they ran out of subjects for the third experiment. Still, I think this mostly rescues the effect: Students didn’t really like the music from Grease in the first two experiments, they just disliked it slightly less (3.72 is a low score on a scale of 1-7!) The third experiment is the only one where there’s a big difference in how much people like the music, and there’s a big effect there. It’s unfortunate they ran out of subjects, though!

Summary: ★? The authors call this a "somewhat unreliable successful replication".

Stereotype priming

Blair and Banaji (1996) claimed that if you briefly flashed a stereotypically "male" or "female" word in front of someone, that would change how long it would take people to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found this could have an effect even if the discrimination was unrelated to the primes (e.g. discriminating cities vs. names).

Müller and Rothermund set out to replicate these effects. They had people come into the lab and fixate on a screen. Then, they’d briefly be shown either stereotypically "male" or "female" priming words. Example male primes are "computer", "to fight", and "aggressive", while example female primes are "ballet", "to put on make-up", and "gossipy". These primes were shown for only 200 ms.

In a first experiment, the prime was followed by either a male name ("Achim", this happened in Germany) or a female name ("Annette"), which subjects needed to classify as quickly as possible. Here were the mean times (and standard deviations) in ms

Target Gender Male Prime Female Prime
male 554 (80) 566 (80)
female 562 (83) 549 (80)

There was a significant effect—albeit a small one. However, Blair and Banaji also found a small effect, although around 2x or 3x larger than this one.

A second experiment was the same, except now they would either see a (male or female) first name 50% of the time, and a city name (e.g. "Aachen") 50% of the time. Now, subjects needed to distinguish a first name from a city name.

Target Gender Male Prime Female Prime
male 605 (91) 605 (90)
female 570 (86) 567 (85)

In this analysis, they simply ignore all trials where a city was shown, so this table is showing how long it takes to recognize male/female names as names. For whatever reason, people found it to be harder to recognize male names, but priming had no effect on this. In contrast, Banaji and Hardin had found that changing the prime would have an effect of around 14 ms.

Summary: ☆ / ✗ Half the replication failed, the other half succeeded with a smaller effect size.

Stereotype susceptibility

Shih et al. (1999) took Asian-American women and had them take a math test. Before the math test, some were primed to think about being women by being given questions about coed or single-sex living arrangements. Others were primed to think about being Asian by answering questions about family and ethnicity. They found that the Asian-primed group 54% right while the female-primed group got 43% right. (The control group got 49%.)

Gibson et al. replicated this at six universities in the Southeastern US. With a sample of 156 subjects (as opposed to only 16) they found that the Asian-primed group go 59% right, while the female-primed group got 53% right. This was smaller and nonsignificant (p=.08). They then excluded participants who weren’t aware of stereotypes regarding math and Asians/women. Under the remaining 127 subjects, the Asian-primed group got 63% right, while the female-primed group got 51% right, and the effect was significant (p=.02).

But then, in a second article, Moon and Roeder tried to replicate exactly the same result using the same experimental protocol. They found that the Asian-primed group got 46% correct, while the female-primed group got 43% correct. This difference was nonsignificant (p=.44). However, in this same experiment, the control group got 50% correct.

Among only those aware of the stereotype, the Asian-primed group got 47% correct, while the female-primed group got 43%. Both of these results were nonsignificant (p=.44, and p=.28, respectively). Here again, the control group got 50%. The higher performance in the control group is inconsistent with the theory of priming, so this is a conclusive failure.

Summary: ☆? / ✗ The first replication basically half-succeeded, while the second failed.

Sex differences in distress from infidelity

A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at datasets from two populations, one with a mean age of 20 years and one with a mean age of 67. They found that in both populations males were more distressed by infidelity than females, though the difference was smaller in the older population.

Hans IJzerman et al. replicated these experiments. In the younger population, they successfully replicated the result. In the older population, they did not replicate the result.

Summary: ☆ / ✗ One successful replication with a smaller effect, and one failed replication.

Psychological distance and moral judgment

Eval et al. (2008) claimed that people made more intense moral judgments for acts that were psychologically distant. Gong and Medin (2012) came to the opposite conclusion.

Žeželj and Jokić replicated this experiment. They had subjects make judgments about the actions of people in hypothetical scenarios. In a first experiment, they described incest or eating a dead pet, but varied if they happened now or in the distant future. Contra Eval et al., future transgressions were judged similarly to near-future transgressions. (Near future was judged 0.12 points more harshly on a -5 to +.5 scale, so the effect was actually in the wrong direction.)

In a second experiment, they instead varied in if subjects were asked to think in the first-person about a specific person they knew performing the act, or to focus on their thoughts and think about it from a third-person perspective. All scenarios showed that people were more harsh when thinking about things from a distance. The difference was around 0.44 averaged over scenarios. This magnitude was significant and similar to what Eval et al.’s research predicted.

A third experiment was similar to the first in that time was varied. The difference was that the scenarios concerned virtuous acts with complications, e.g. a company making a donation to the poor that improves their sales. They actually found the opposite effect that Eval et al. would have predicted: the distant future acts were judged less virtuous. The difference was only 0.32 and not significant.

In a fourth experiment, participants varied in if they were primed by initial questions to be in a high-level or low-level mindset. Here, they found that those primed to be a low-level were more harsh than those at high-level. This was statistically significant and consistent with the predictions of Eval et al., albeit around half the magnitude of effect.

Summary: ✗ ✗ ☆ ★ Two clear failures, one success with a smaller effect, and one success with a similar effect. This should be an average score of 1.5/4, but to keep everything integer, I’ve scored it as 2/4 above.

Cleanliness and moral judgments

Schnall et al. (2008) claimed that people make less severe moral judgments when they feel clean.

Johnson et al. replicated these experiments. Participants first completed a puzzle that had either neutral words or cleanliness words, and then responded to a sequence of moral dilemmas. They found no effect at all from being primed with cleanliness words.

In a second experiment, they watched a clip from Trainspotting with a man using an unclean toilet. They were then asked to wash their hands (or not) and then asked about the same moral dilemmas. They found no effect at all from being assigned to wash your hands.

Summary: ✗ Clear failure

Physical warmth and interpersonal warmth

Williams and Barg (2008) published an article in Science that claimed that people who were physically warm would behave more pro-socially.

Lynott et al. replicated this. Participants were randomly given either a cold pack or a heat pack to evaluate, and then could either choose a gift for a friend or for themselves. Williams and Barg found that those given heat were around 3.5x as likely to be pro-social. In the replication, they were actually slightly less likely.

Summary: ✗ Clear failure

Moral licensing

Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable.

Blanken et al. reproduced a set of experiments by Sachdeva et al. (2009) In a first experiment, they had participants were induced to write a short story about themselves using words that were either positive, neutral, or negative. Afterward, they asked how much they would be willing to donate to charity. Contra previous work, they found people with positive words were willing to donate slightly more (not significant).

A second experiment was similar except rather than being asked to donate to charity, participants imagined they ran a factory and were asked if they would run a costly filter to reduce pollution. Again, if anything the effect was the opposite of predicted, though it was non-significant.

In a third experiment, they used an online sample with many more subjects, and asked both of the previous questions. For running the filter, they found no effect. For donations, they found that there was no difference between neutral and positive priming, but people who were negatively primed did donate slightly more, and this was statistically significant (p=.044).

Arguably this is one successful replication, but let’s be careful: They basically ran four different experiments (all combinations of donations / running-filters and in-person / online subjects). For each of these they had three different comparisons ( positive-vs-neutral / positive-vs-negative / neutral-vs-negative). That’s a lot of opportunities for false discovery, and the one effect that was found is just barely significant.

Summary: ✗ ✗ ✗? Two clear failures and one failure that you could maybe / possibly argue is a success.

Superstition and performance

Damish et al. (2010) found that manipulating superstitious feelings could have dramatic effects on golfing performance. Subjects told that a ball was lucky were able to make 65% of 100cm putts, as opposed to 48% of controls.

Calin-Jageman and Caldwell reproduced this experiment. They found that the superstition-primed group was only 2% more accurate, which was not significant.

In a second experiment, they tried to make the "lucky" group feel even luckier by having a ball with a shamrock on it and saying "wow! you get to use the lucky ball". Again, there was no impact.

Summary: ✗ Clear failure

Moral behavior and brightness

Banerjee et al (2012) found that recalling unethical behavior caused people to see the room as darker.

Brant et al. replicated this. Participants were first asked to describe something they did in the past that was either ethical or unethical. In a first study, they were then asked about how bright their room was. In a second study, they were instead asked how desirable lamps, candles, and flashlights were.

They found nothing. Recalling ethical vs. unethical behavior had no effect on the estimated brightness of the room, or how much people wanted light-emitting products.

Summary: ✗ Clear failure

The human regression ensemble

I sometimes worry that people credit machine learning with magical powers. Friends from other fields often show me little datasets. Maybe they measured the concentration of a protein in some cell line for the last few days and they want to know what it will be tomorrow.

Day Concentration
Monday 1.32
Tuesday 1.51
Wednesday 1.82
Thursday 2.27
Friday 2.51
Saturday ???

Sure, you can use a fancy algorithm for this, but I usually recommend to just stare hard at the data, use your intuition, and make a guess. My friends respond with horror—you can’t just throw out predictions, that’s illegal! They want to use a rigorous method with guarantees.

Now, it’s true we have methods with guarantees, but those guarantees are often a bit of a mirage. For example, you can do a linear regression and get a confidence interval for the regression coefficients. That’s fine, but you’re assuming (1) the true relationship is linear, (2) the data are independent, (3) the noise is Gaussian, and (4) the magnitude of noise is constant. If those (unverifiable) assumptions aren’t true, your guarantees don’t hold.

All predictions need assumptions. The advantage of the "look at your data and make a guess" method is that you can’t fool yourself about this fact.

But is it really true that humans can do as well as algorithms for simple tasks? Let’s test this.

What I did

1. I defined four simple one-dimensional regression problems using common datasets. For each of those problems, I split the data into a training set and a test set. Here’s what that looks like for the boston dataset

2. I took the training points, and plotted them to a .pdf file as black dots, with four red dots for registration.

In each .pdf file there were 25 identical copies of the training data like above.

3. I transferred that .pdf file to my tablet. On the tablet, I hand-drew 25 curves that I felt were all plausible fits of the data.

4. I transferred the labeled .pdf back to my computer, and wrote some simple image processing code that would read in all of the lines and average them. I then used this average to predict the test data.

5. As a comparison, I made predictions for the test data using six standard regression methods: Ridge regression (Ridge), local regression (LOWESS), Gaussian processes regression (GPR), random forests (RF), neural networks (MLP) and K-nearest neighbors (K-NN). More details about all these methods are below.

6. To measure error, I computed the root mean squared error (RMSE) and the mean absolute error (MAE).

To make sure the results were fair, I committed myself to just drawing the curves for each dataset once, and never touching them again, even if I did something that seems stupid in retrospect—which as you’ll see below, I did.

On the other hand, I had to do some of tinkering with all the machine learning methods to get reasonable results, e.g. changing how neural networks were optimized, or what hyper-parameters to cross-validate over. This might create some bias, but if it does, it’s in favor of the machine learning methods and against me.


For the boston dataset, I used the crime variable for the x-axis and house value variable for the y-axis. Here’s all the lines I drew on top of each other:

And here are the results comparing to the machine learning algorithms:

Here are the results for the diabetes dataset. I used age for the x-axis and disease progression for the y-axis. (I don’t think I did a great job drawing curves for this one.)

Here are the results for the iris dataset, using sepal length for the x-axis and petal width for the y-axis.

And finally, here are the results for the wine dataset, using malic acid for the x-axis and alcohol for the y-axis.

I tend to think I under-reacted a bit to the spike of data with with x around 0.2 and large y values. I thought at the time that it didn’t make sense to have a non-monotonic relationship between malic acid and alcohol. However, in retrospect it could easily be real, e.g. because it’s a cluster of one type of wine.

Summary of results

Here’s a summary of the the RMSE for all datasets.

Method Boston Diabetes Iris Wine
Ridge .178 .227 .189 .211
LOWESS .178 .229 .182 .212
Gaussian Process .177 .226 .184 .204
Random Forests .192 .226 .192 .200
Neural Nets .177 .225 .185 .211
K-NN .178 .232 .186 .202
justin .178 .230 .181 .204

And here’s a summary of the MAE

Method Boston Diabetes Iris Wine
Ridge .133 .191 .150 .180
LOWESS .134 .194 .136 .180
Gaussian Process .131 .190 .139 .170
Random Forests .136 .190 .139 .162
Neural Nets .131 .190 .139 .179
K-NN .129 .196 .137 .165
justin .121 .194 .138 .171

Honestly, I’m a little surprised how well I did here—I expected that I’d do OK but that some algorithm (probably LOWESS, still inexplicably not in base scikit-learn) would win in most cases.

I’ve been doing machine learning for years, but I’ve never run a "human regression ensemble" before. With practice, I’m sure I’d get better at drawing these lines, but I’m not going to get any better at applying machine learning methods.

I didn’t do anything particularly clever in setting up these machine learning methods, but it wasn’t entirely trivial (see below). A random person in the world is probably more likely than I was to make a mistake when running a machine learning method, but just as good at drawing curves. This is an extremely robust way to predict.

What’s the point of this? It’s just that machine learning isn’t magic. For simple problems, it doesn’t fundamentally give you anything better than you can get just from common sense.

Machine learning is still useful, of course. For one thing, it can be automated. (Drawing many curves is tedious…) And with much larger datasets, machine learning will—I assume—beat any manual predictions. The point is just that in those cases it’s an elaboration on common sense, not some magical pixie dust.

Details on the regression methods

Here were the machine learning algorithms I used:

  1. Ridge: Linear regression with squared l2-norm regularization.
  2. LOWESS: Locally-weighted regression.
  3. GPR: Gaussian-process regression with an RBF kernel
  4. RF: Random forests
  5. MLP: A single hidden-layer neural network / multi-layer perceptron with tanh nonlinearities, optimized by (non-stochastic) l-bfgs with 50,000 iterations.
  6. KNN: K-nearest neighbors

For all the methods other than gaussian processes, I used 5-fold cross-validation to tune the key hyper- parameter. The options I used were

  1. Ridge: Regularization penalty of λ=.001, .01, .1, 1, or 10.
  2. LOWESS: Bandwidth of σ=.001,.01,.1,1,10
  3. Random forests: Minimum samples in each leaf of n=1,2,…,19
  4. Multi-layer perceptrons: Used 1, 5, 10, 20, 50, or 100 hidden units, with α=.01 regularization.
  5. K-nearest neighbors: used K=1,2,…,19 neighbors.

For Gaussian processes, I did not use cross-validation, but rather scikit-learn’s built-in hyperparameter optimization. In particular, I used the magical incantation kernel = ConstantKernel(1.0,(.1,10)) + ConstantKernel(1.0,(.1,10)) * RBF(10,(.1,100)) + WhiteKernel(5,(.5,50)) which I understand means the system optimizes the kernel parameters to maximize the marginal likelihood.

Calculating variance in the log domain

Say you’ve got a positive dataset and you want to calculate the variance. However, the numbers in your dataset are huge, so huge you need to represent them in the log-domain. How do you compute the log-variance without things blowing up?

I ran into the problem today. To my surprise, I couldn’t find a standard solution.

The bad solution

Suppose that your data is x_1 \cdots x_n, which you have stored as l_1 \cdots l_n where x_i = \exp(l_i). The obvious thing to do is to just exponentiate and then compute the variance. That would be something like the following:

\text{For all } i\in{1\cdots n,} \text{ set} x_i = \exp(l_i)

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \frac{1}{n} \sum_{j=1}^n x_j)^2

This of course is a terrible idea: When l_i is large, you can’t even write down x_i without running into numerical problems.

The mediocre solution

The first idea I had for this problem was relatively elegant. We can of course represent the variance as

\text{var} = \bar{x^2} - \bar{x}^2.

Instead of calculating \bar{x} and \bar{x^2}, why not calculate the log of these quantities?

To do this, we can introduce a “log domain mean” operator, a close relative of the good-old scipy.special.logsumexp

def log_domain_mean(logx):
    "np.log(np.mean(np.exp(x))) but more stable"
    n = len(logx)
    damax = np.max(logx)
    return np.log(np.sum(np.exp(logx-damax))) \
    + damax-np.log(n)

Next, introduce a “log-sub-add” operator. (A variant of np.logaddexp)

def logsubadd(a,b):
    "np.log(np.exp(a)-np.exp(b)) but more stable"
    return a + np.log(1-np.exp(b-a))

Then, we can compute the log-variance as

def log_domain_var(logx):
    a = log_domain_mean(2*logx)
    b = log_domain_mean(logx)*2
    c = logsubadd(a,b)
    return c

Here a is \log \bar{x^2} while b is \log \bar{x}^2.

Nice, right? Well, it’s much better then the first solution. But it isn’t that good. The problem is that when the variance is small, a and b are close. When they are both close and large, logsubadd runs into numerical problems. It’s not clear that there is a way to fix this problem with logsubadd.

To solve this, abandon elegance!

The good solution

For the good solution, the math is a series of not-too-intuitive transformations. (I put them at the end.) These start with

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2

and end with

\log \text{var} = \log \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 + 2 \log \bar{x} - \log(n).

Why this form? Well, we’ve reduced to things we can do relatively stably: Compute the log-mean, and do a (small variant of) log-sum-exp.

def log_domain_var(logx):
    """like np.log(np.var(np.exp(logx)))
    except more stable"""
    n = len(logx)
    log_xmean = log_domain_mean(logx)
    return np.log(np.sum( np.expm1(logx-log_xmean)**2))\
    + 2*log_xmean - np.log(n)

This uses the log_domain_mean implementation from above, and also np.expm1 to compute \exp(a)-1 in a more stable wauy when a is close to zero.

Why is this stable? Is it really stable? Well, umm, I’m not sure. I derived transformations that “looked stable” to me, but there’s no proof that this is best. I’d be surprised if a better solution wasn’t possible. (I’d also be surprised if there isn’t a paper from 25+ years ago that describes that better solution.)

In any case, I’ve experimentally found that this function will (while working in single precision) happily compute the variance even when logx is in the range of -10^{30} to 10^{30}, which is about 28 orders of magnitude better than the naive solution and sufficient for my needs.

As always, failure cases are probably out there. Numerical instability always wins when it can be bothered to make an effort.

Appendix: The transformations

\text{var} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i) - \bar{x})^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i) / \bar{x}^2 - 1)^2 \bar{x}^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i - \log \bar{x}^2) - 1)^2 \bar{x}^2

\text{var} = \frac{1}{n} \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 \bar{x}^2

\log \text{var} = \log \sum_{i=1}^n (\exp (l_i - 2\log \bar{x}) - 1)^2 + 2 \log \bar{x} - \log(n)

Back to Basics: Approximate Bayesian Computation

(All based on these excellent slides from Umberto Picchini)

“Approximate Bayesian Computation” sounds like a broad class of methods that would potentially include things like message passing, variational methods, MCMC, etc. However, for historical reasons, the term is used for a very specialized class of methods.

The core idea is as follows:

Sample from the posterior using rejection sampling, with the accept/reject decision made by generating a synthetic dataset and comparing it to the observed one.

The basic idea

Take a model {p(z,x)}. Assume we observe some fixed {x} and want to sample from {p(z|x).} Assume {x} is discrete.

Algorithm (Basic ABC):

  • Sample {(z^{*},x^{*})\sim p(z,x)}
  • If {x^{*}=x}, return {z^{*}}
  • Else, repeat.

Claim: This algorithm returns an exact sample from the posterior {p(z|x).}

Proof: The probability of returning {z^{*}} is the probability of (i) drawing {z^{*}} from the prior, and (ii) sampling {x} conditional on {z}. Thus

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})p(x\vert z^{*})\\ & = & p(z^{*},x)\\ & = & p(x)p(z^{*}\vert x). \end{array}

The probability of returning {z^{*}} in any one iteration is the posterior times the constant {p(x).} So this gives exact samples.

Why this is interesting.

There’s two special properties:

  • It gives exact samples from the posterior. For so many Bayesian methods, it’s hard to know if you have a good answer or not. Here, if the algorithm successfully returns anything, you have an exact sample.
  • It works under minimal assumptions about the target model. All that’s needed is (i) To be able to simulate p(z) and (ii) to be able to simulate p(x|z). You don’t even need to be able to evaluate either of these!

Why this works badly in general.

Of course, the problem is that you’ll typically be unlikely to exactly hit {x}. Formally speaking, the probability of returning anything in a given loop is

\displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(x). \end{array}

In high dimensions, {p(x)} will typically be small unless you tend to get the same data regardless of the value of the latent variable. (In which case is the problem even interesting?)

It’s just rejection sampling

This is almost exactly rejection sampling. Remember that in general if you want to sample from {P(z)} you need a proposal distribution {Q(z)} you can sample from and you need to know a constant such that {\frac{P(z)}{M}\leq Q(z).} The above algorithm is just using

\displaystyle  \begin{array}{rcl}  P(z) & = & p(z|x^{*})\\ Q(z) & = & p(z)\\ M & = & \frac{1}{p(x^{*})}. \end{array}

Then, {Q(z)} is a valid proposal since {P(z)/M\leq Q(z)} is equivalent to {p(z,x^{*})\leq p(z)} which is always true.

Why isn’t this exactly rejection sampling? In traditional descriptions of rejection sampling, you’d need to calculate {P(z)} and {M}. In the ABC setting we can’t calculate either of these, but we exploit that we can calculate the ratio {P(z)/M.}

Adding an {\epsilon}

To increase the chance of accepting (or make the algorithm work at all if {x} is continuous) you need to add a “slop factor” of {\epsilon}. You change the algorithm to instead accept if \displaystyle  d(x,x^{*})\leq\epsilon for some small {\epsilon}. The value of {\epsilon} introduces an accuracy computation tradeoff. However, this doesn’t solve the fundamental problem if things don’t scale that well to high dimensions.

Summary statistics

Another idea to reduce expense is to instead compare summary statistics. That is, find some function {s(x)} and accept if \displaystyle  s(x)=s(x^{*}). rather than if \displaystyle x=x^* as before.

If we make this change, then the probability of returning \displaystyle z^* in any one iteration is

\displaystyle  \begin{array}{rcl}  \text{probability of returning }z^{*} & = & p(z^{*})\sum_{x}p(x\vert z^{*})I[s(x)=s(x^{*})].\\ & = & p(z^{*})p(s(x)\vert z^{*})\\ & = & p(s(x))p(z^{*}\vert s(x)). \end{array}

Above we define {p(s'\vert z)=\sum_{x}p(x\vert z)I[s'=s(x)]} and {p(s')=\sum_{x}p(x)I[s(x)=s'].}

The probability of returning anything in a given round is \displaystyle  \begin{array}{rcl}  \sum_{z^{*}}\text{probability of returning }z^{*} & = & p(s(x)). \end{array}

There’s good news and bad news about making this change.

Good news:

  • We have improved the acceptance rate from {p(x)} to {p(s(x)).} This could be much higher if there are many different datasets that yield the same summary statistics.

Bad news:

  • This introduces errors in general. To avoid introducing error, we need that {p(z\vert s(x))=p(z\vert x).}

Exponential family

Often, summary statistics are used even though they introduce errors. It seems to be a bit of a craft to find good summary statistics to speed things up without introducing too much error.

There is on appealing case where no error is introduced. Suppose {p(x|z)} is in the exponential family and {s(x)} are the sufficient statistics for that family. Then, we know that {p(z|x)=p(z\vert s(x))}. This is very nice.

Slop factors

If you’re using a slop factor, you can instead accept according to \displaystyle  d(s(x),s(x^{*}))\leq\epsilon.
This introduces the same kind of computation / accuracy tradeoff.

ABC-MCMC (Or likelihood-free MCMC)

Before getting to ABC-MCMC, suppose we just wanted for some reason to use Metropolis-Hastings to sample from the prior {p(z).}. If our proposal distribution was q(\tilde{z}\vert z) then we’d do

Algorithm: (Regular old Metropolis-Hastings)

  • Initialize {z} somehow
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)} from some proposal distribution
    • Compute acceptance probability \displaystyle {\alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

Now suppose we instead want to sample from the posterior {p(z|x)} instead. We will suggest the following algorithm instead, with changes shown in blue.

Algorithm: (ABC-MCMC)

  • Initialize {z} somehow and initialize {x^{*}=x.}
  • Repeat:
    • Propose {\tilde{z}\sim q(\tilde{z}\vert z)}.
    • Sample a synthetic dataset {\tilde{x}^{*}\sim p(\tilde{x}^{*}\vert\tilde{z})}.
    • Compute acceptance probability {{\displaystyle \alpha\leftarrow\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}}} {{\displaystyle I[\tilde{x}^{*}=x^{*}]}}.
    • Generate {r\sim U(0,1).}
    • If {r\leq\alpha} then {z\leftarrow\tilde{z}}.

There is only one difference: After proposing {\tilde{z}}, we generate a synthetic dataset. We can accept only if the synthetic dataset is the same as the observed one.

What this solves

There are two computational problems that the original ABC algorithm can encounter:

  • The prior {p(z)} may be a terrible proposal distribution for the posterior {p(z|x)}. Maybe random samples from the prior almost never yield datasets similar to the observed.
  • Even with a good proposal {z}, the acceptance probability {p(x)} might be very low.

The MCMC-ABC algorithm seems intended to deal with the first problem: If the proposal distributon {q(\tilde{z}\vert z)} only yields nearby points, than once the typical set has been reached, the probability of propsing a “good” {\tilde{z}} is much higher.

On the other hand, MCMC-ABC algorithm seems to do little to address the second problem.


Now, why is this a correct algorithm? Consider the augmented distribution \displaystyle  p(z,x^{*},x)=p(z,x^{*})I[x^{*}=x].

We now want to sample from {p(z,x^{*}\vert x)} using Metropolis-Hastings. We choose the proposal distribution

\displaystyle  q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})=q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z}).

The acceptance probability will then be

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*},x)}{p(z,x^{*},x)}\frac{q(z,x^{*}\vert\tilde{z},\tilde{x}^{*})}{q(\tilde{z},\tilde{x}^{*}\vert z,x^{*})}\right)\\ & = & \min\left(1,\frac{p(\tilde{z},\tilde{x}^{*})I[\tilde{x}^{*}=x]}{p(z,x^{*})I[x^{*}=x]}\frac{q(z\vert\tilde{z})p(x^{*}\vert z)}{q(\tilde{z}\vert z)p(\tilde{x}^{*}\vert\tilde{z})}\right) \end{array} .

Since the original state {(z,x^{*})} was accepted, it must be true that {x^{*}=x.} Thus, the above can be simplified into

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})}{p(z)}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right)I[\tilde{x}^{*}=x]. \end{array}


If using summary statistics, you change {I[\tilde{x}^{*}=x]} into {I[s(\tilde{x}^{*})=s(x)].} You can also add a slop factor if you want.

More generally still, we could instead use the augmented distribution

\displaystyle  p(z,x^{*},x)=p(z,x^{*})p(x\vert x^{*}).

The proposal {p(x\vert x^{*})} can be something interesting like a Multivariate Gaussian. The acceptance probability then instead becomes

\displaystyle  \begin{array}{rcl}  \alpha & = & \min\left(1,\frac{p(\tilde{z})p(x\vert\tilde{x}^{*})}{p(z)p(x\vert x^{*})}\frac{q(z\vert\tilde{z})}{q(\tilde{z}\vert z)}\right). \end{array}

Of course, this introduces some error.

Choosing {\epsilon}

In practice, a good value of {\epsilon} at the end will lead to very slow progress at the beginning. Best to slowly reduce {\epsilon} over time. Seems like shooting for a low 1% acceptance rate at the end is a good compromise. A higher acceptance rate would mean that too much error was introduced.

(Thanks to Javier Burroni for helpful comments.)

Three opinions on theorems

1. Think of theorem statements like an API.

Some people feel intimidated by the prospect of putting a “theorem” into their papers. They feel that their results aren’t “deep” enough to justify this. Instead, they give the derivation and result inline as part of the normal text.

Sometimes that’s best. However, the purpose of a theorem is not to shout to the world that you’ve discovered something incredible. Rather, theorems are best thought of as an “API for ideas”. There are two basic purposes:

  • To separate what you are claiming from your argument for that claim.
  • To provide modularity to make it easier to understand or re-use your ideas.

To decide if you should create a theorem, ask if these goals will be advanced by doing so.

A thoughtful API makes software easier to use: The goal is to allow the user as much functionality as possible with as simple an interface as possible, while isolating implementation details. If you have a long chain a mathematical argument, you should choose what parts to write as theorems/lemmas in much the same way.

Many paper intermingle definitions, assumptions, proof arguments, and the final results. Have pity on the reader, and tell them in a single place what you are claiming, and under what assumptions. The “proof” section separates your evidence for your claim from the claim itself. Most readers want to understand your result before looking at the proof.  Let them. Don’t make them hunt to figure out what your final result it.

Perhaps controversially, I suggest you should use the above reasoning even if you aren’t being fully mathematically rigorous. It’s still kinder to the reader to state your assumptions informally.

As an example of why it’s helpful to explicitly state your results, here’s an example from a seminal paper, so I’m sure the authors don’t mind. (Click on the image for a larger excerpt.)


This proof is well written. The problem is many small uncertainties that accumulate as you read it. If you try to understand exactly:

  • What result is being stated?
  • Under what assumptions does that result hold?

You will find that the proof “bleeds in” to the result itself. The convergence rate in 2.13 involves Q(\theta) defined in 2.10, which itself involves other assumptions tracing backwards in the paper.

Of course, not every single claim needs to be written as a theorem/lemma/claim. If your result is simple to state and will only be used in a “one-off” manner, it may be clearer just to leave it in the text. That’s analogous to “inlining” a small function instead of creating another one.

2. Don’t fear the giant equation block.

I sometimes see a proof like this (for 0<x<1)

Take the quantity


Pulling out x this becomes

x \frac{1-x}{\sqrt{x^2-2x+1}}.

Factoring the denominator, this is

x \frac{1-x}{\sqrt{(x-1)(x-1)}}.


For some proofs, the text between each line just isn’t that helpful. To a large degree it makes things more confusing– without an equality between the lines, you need to read the words to understand how each formula is supposed to be related to the previous one. Consider this alternative version of the proof:

\begin{aligned} \frac{x-x^2}{\sqrt{x^2-2x+1}}  = & x \frac{1-x}{\sqrt{x^2-2x+1}} \\ = & x \frac{1-x}{\sqrt{(x-1)(x-1)}} \\ = & x \frac{1-x}{\sqrt{(x-1)^2}} \\ = & x \frac{1-x}{1-x} \\ = & x \\ \end{aligned}

In some cases, this reveals the overall structure of the proof better than a bunch of lines with interspersed text. If explanation is needed, it can be better to put it at the end, e.g. as “where line 2 follows from [blah] and line 3 follows from [blah]”.

It can also be helpful to put these explanations inline, i.e. to us a proof like

\begin{aligned} \frac{x-x^2}{\sqrt{x^2-2x+1}} = & x \frac{1-x}{\sqrt{x^2-2x+1}} & \text{ pull out x} \\ = & x \frac{1-x}{\sqrt{(x-1)(x-1)}} & \text{ factoring denominator} \\ = & x \frac{1-x}{\sqrt{(x-1)^2}} & \\ = & x \frac{1-x}{1-x} & \text{ since denominator is positive} \\ = & x & \\ \end{aligned}

Again, this is not the best solution for all (or even most) cases, but I think it should be used more often than it is.

3. Use equivalence of inequalities when possible.

Many proofs involve manipulating chains of inequalities. When doing so, it should be obvious at what steps extra looseness may have been introduced. Suppose you have some positive constants a and c with a^2>c and you need to choose b so as to ensure that b^2+c \leq a^2.

People will often prove a result like the following:

Lemma: If b \leq \sqrt{a^2-c}, then b^2+c \leq a^2.

Proof: Under the stated condition, we have that

\begin{aligned} b^2 + c & \leq & (\sqrt{a^2-c})^2+c \\ & = & a^2-c+c \\ & = & a^2 \end{aligned}

That’s all correct, but doesn’t something feel slightly “magical” about the proof?

There are two problems: First, the proof reveals nothing anything about how you came up with the final answer. Second, the result leaves ambiguous if you have introduced additional looseness. Given the starting assumption, is it possible to prove a stronger bound?

I think the following lemma and proof are much better:

Lemma: b^2+c \leq a^2 if and only if b \leq \sqrt{a^2-c}.

Proof: The following conditions are all equivalent:

\begin{aligned} b^2+c & \leq & a^2 \\ b^2 & \leq & a^2-c \\ b & \leq & \sqrt{a^2-c}. \\ \end{aligned}

The proof shows exactly how you arrived at the final result, and shows that there is no extra looseness. It’s better not to “pull a rabbit out of a hat” in a proof if not necessary.

This is arguably one of the most basic possible proof techniques, but is bizarrely underused. I think there’s two reasons why:

  1. Whatever need motivated the lemma is probably met by the first one above. The benefit of the second is mostly in providing more insight.
  2. Mathematical notation doesn’t encourage it. The sentence at the beginning of the proof is essential. If you see this merely as a series of inequalities, each implied by the one before, than it will not give the “only if” part of the result. You could conceivably try to write something like a < b \Leftrightarrow \exp a < \exp b, but this is awkward with multiple lines.

Exponential Families Cheatsheet

As part of the graphical models course I taught last spring, I developed a “cheatsheet” for exponential families. The basic purpose is to explain the standard moment-matching condition of maximum likelihood. The goal of the sheet was to clearly show how this property generalized to maximum likelihood in conditional exponential families, with hidden variables, or both. It’s available as an image below, or as a PDF here. Please let me know about any errors!



I use the (surprisingly controversial) convention of using a sans-serif font for random variables, rather than capital letters. I’m convinced this is the least-bad option for the machine learning literature, where many readers seem to find capital letter random variables distracting. It also allows you to distinguish matrix-valued random variables, though that isn’t used here.

Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

  • You’ve got a coin, which comes up heads with probability w. You don’t know w.
  • You flip the coin n times.
  • You see k heads and n-k tails.
  • You do something, depending on k. (We’ll come back to this.)

Simple enough, right? Remember, k is the total number of heads after n flips. If you do some math, you can work out a formula for p(k\vert w,n): the probability of seeing exactly k heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written \mathrm{Binomial}(k\vert n,w).

Here’s an example of what this looks like with n=21 and w=0.5.


Naturally enough, if w=.5, with 21 flips, you tend to see around 10-11 heads. Here’s an example with w=0.2. Here, the most common value is 4, close to 21\times.2=4.2.


Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as


This is just a function of one integer k. You can think of this as just a list of what guess to make, for each possible observation, for example:

k \mathrm{Dec}(k)
0 \mathrm{tails}
1 \mathrm{heads}
2 \mathrm{heads}
\vdots \vdots
n \mathrm{tails}

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k<n/2 \end{cases}.

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

  • Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
  • Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as


The first input is the true (unknown) value w, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now L is defined comparing d to w, not to a new coinflip. We do this because comparing to w gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were w. Take loss A. If you predict “tails”, you’ll be wrong with probability w, while if you predict “heads”, you’ll be wrong with probability 1-w, and so lose 1-w dollars on average. This leads to the loss

L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).

Here, the second input to R is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k<m \end{cases}.

With n=21 there are 22 such decision rules, corresponding to m=0, (always predict heads), m=1 (predict heads if you see at least one heads), up to m=22 (always predict tails). These are shown here:

These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses L_{A} and L_{B}, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix w=0.2, and I sweep through all the decision rules (by changing m) with loss L_{A}:


The value R_A in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for w=0.4, which you can is pictured here:


We can visualize the risk in one figure with various w and m. Notice that the curves for w=0.2 and w=0.4 are exactly the same as we saw above.


Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss L_{B} we get the following:


Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes R(w,\mathrm{Dec})?

The answer is: because we don’t know w! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution p(w) over w. This says which values of w we think are more and less likely. Then, we define the Bayesian risk as the average of R over the prior:


This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think w is. Here’s the Bayes risk corresponding to L_{A} with a uniform prior p(w)=1:


For reference, the risk curves R(w,\mathrm{Dec}_m) are shown in light grey. Naturally enough, for each value of m, the Bayes risk is just the average of the regular risks for each w.

Here’s the risk corresponding to L_{B}:


That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones \mathrm{Dec}_m. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are 2^{22}, since you can output heads or tails for each of k=0, k=1, …, k=21.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).

This is a sum over k where each term only depends on a single value \mathrm{Dec}(k). So, we just need to make the best decision for each individual value of k separately. This leads to the Bayes-optimal decision rule of

\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.

With a uniform prior p(w)=1, here’s the optimal Bayesian decision rules with loss L_{A}:


And here it is for loss L_B:


Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for L_{B}, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family \mathrm{Dec}_{m} defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process (p(k\vert w,n)) the loss function (L(w,d)) and the prior distribution (p(w)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

  • Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
  • Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
  • Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with w drawn from p(w). Then, no decision rule will incur less loss than \mathrm{Dec}_{\text{Bayes}}. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as


Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and L_{A}:


As you can see, for each individual decision rule, it searches over the space of parameters w to find the worst case. We can visualize the risk with L_{B} as:


What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules \mathrm{Dec}_m. But be very mindful that this doesn’t work in general!

For L_{A} we end up with the same decision rule as when minimizing Bayesian risk:


For L_{B}, meanwhile, we get something slightly different:


This is even more conservative than the Bayesian decision rule. \mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}, while \mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of w. Even calculating R(w,\mathrm{Dec}) for fixed w and \mathrm{Dec} is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.