More confidence games

You drew 40 random cells from a sample and found that a new drug affected 16 of them. An online calculator told you:

“With 90% confidence, the true fraction is between 26.9% and 54.2%.”

But what does this really mean? We’ve talked about confidence sets before—read that post first if you find this one too difficult. Now let’s talk about intervals.

Confidence intervals

Years from now, all Generation β does is sit around meditating on probability theory and reading Ars Conjectandi. You work at a carnival where one day your boss says, “Our new ride is now so popular that we only have capacity for 10% of guests. I’ve got some bent coins of different colors, and I want you to create a guessing game to ration access to the ride.

Here’s how the game is supposed to work:

  1. The guest will pick one coin, flip it, and announce the outcome.
  2. Based on that outcome, you guess some set of colors.
  3. The true color of the coin is revealed. If it’s not in the set of colors, the guest can go on the ride.

It’s essential that no matter what the guests do, only 10% of them can win the game. But otherwise, you’d like to guess as few colors as possible to better impress the players.

You’re given five bent coins, which your boss has CT-scanned and run exhaustive simulations to find the true probability each will come up heads.

Coin Prob. tails Prob. heads
red .9 .1
green .7 .3
blue .5 .5
yellow .3 .7
white .1 .9

Your first idea for a game is the obvious one: Flip it, and try to guess the color based on the outcome of heads or tails. While you could do that it would be extremely boring.

Then, you have another idea. Why not flip the coin twice? Take the red coin. It’s easy to compute the probability of getting different numbers of heads:

  • 0 heads: (0.9)2 = 0.81 (The probability of rolling tails twice in a row.)
  • 1 head: 0.1 × 0.9 + 0.9 × 0.1 = 0.18. (The probability of either rolling heads × tails plus the probability of rolling tails × heads.)
  • 2 heads: (0.1)2 = 0.01 (The probability of rolling heads twice in a row.)

Continuing this way, you make a table of the probability of getting a total number of heads after two coin flips for each of the coins:

0 heads 1 head 2 heads
red .81 .18 .01
green .49 .42 .09
blue .25 .50 .25
yellow .09 .42 .49
white .01 .18 .81

You could make a game based on two coinflips, but why not make things even more interesting? Why not spice things up even more by flipping the coin, say, 5 times. You do a little bit of research, and you discover that the probability of getting a total of tot-heads heads after doing num-flips flips of a coin with a bias of prob is called a Binomial, namely Binomial(tot-heads | num-flips, prob). For example, the probabilities we calculated above are Binomial(0 | 2, 0.1)=0.81, Binomial(1 | 2, 0.1)=0.18, and Binomial(1 | 2, 0.1)=0.01. If num-flips is larger than 2 the math gets more complicated, but who cares? You find some code that can compute Binomial probabilities, and you use it to create the following table of the probability of getting each total number of heads after 5 coinflips:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This seems to make sense: The most likely outcome for the red coin is all tails, since the red coin is rarely heads. The most likely outcomes for the blue coin are nearly evenly distributed, while the most likely outcome for the white coin is all heads.

Now, what colors should you guess for each outcome? Again, you need to make sure that, no matter what color the guest chooses, you will include that color with 90% probability. This is equivalent to covering .9 of the probability from each row. You decide to go about this in a greedy way. For each row, you add entries from largest to smallest until you get a total that’s above 0.9. If you do that, you get this result:

0 1 2 3 4 5
red .59049 .32805 .07290 .00810 .00045 .00001
green .16807 .36015 .30870 .13230 .02835 .00243
blue .03125 .15625 .31250 .31250 .15625 .03125
yellow .00243 .02835 .13230 .30870 .36015 .16807
white .00001 .00045 .00810 .07290 .32805 .59049

This corresponds to the following confidence sets:

Outcome What you guess
0 {red, green}
1 {red, green, blue}
2 {green, blue, yellow}
3 {green, blue, yellow}
4 {blue, yellow, white}
5 {yellow, white}

Remember what we stressed last time: When we get 4 heads and say we are “90% confident” the color is blue, yellow, or white, we don’t mean “90% probability”, we just mean that your guessing procedure will work for 90% of guests. After seeing 4 heads, the probability of given colors is—according to the worldview of confidence intervals—meaningless because it’s a fixed quantity. And even if you’re willing to talk about probabilities in such situations, the probability could be much higher or lower than 90%.

It occurs to you that you can visualize this as a heatmap, with lighter colors representing higher probabilities. The entries in the following figure are laid out in the same way as the above table. Remember that red has a .1 probability of .being heads so it is in the first row, green has a .3 probability so it’s in the second row, etc.

You can visualize the confidence sets by drawing an outline around the coin/outcome pairs that are included in your strategy.

Things go well for a while, but then your boss comes around again and says “People want more of a challenge!” You’re given 19 coins with each of the probabilities .05, .10, .15, …, .95 and told to increase the number of coin flips from 5 to 40.

At this point, it would be tedious to look at tables of numbers, but you can still visualize things:

You can use the same greedy strategy of including elements from each row until you get a sum of 0.9. If you do that, this is what you end up covering:

Notice: For any given outcome, the set of coins that you include are always next to each other. This happens just because for each coin, there’s a single mode of probability around a given outcome, and the location of this mode changes smoothly as the bias of the coin changes. This is why we can talk about confidence intervals rather than confidence sets: The math happens to work out in such a way that the included coins are always next to each other.

Finally, your boss suggests one last change. The game should work this way:

  1. Each guest is given a soft-metal coin, which they can bend into whatever shape they want.
  2. That coin is flipped 40 times, and the outcome is announced.
  3. You need to guess some interval that hopefully contains the true bias of the coin.
  4. The coin is CT-scanned, and the carnival’s compute cluster finds the true bias. If it’s not in the interval you guessed, the guest can go on the ride.

Thinking about how to address this came, it occurs to you that you can make figures in the same way with any number of coins, and if you use a fine enough grid, you will cover all possibilities. The following figure shows what you get if you use the same process with 1001 coins ranging from 0.000, 0.001, 0.002, …, 1.000.

Now, remember, where we started: We tested 40 cells and found that 16 of them had changed, and an online calculator told us that with 90% confidence the true fraction was between 26.9% and 54.2%.

To understand where these numbers come from, just take this figure and put a vertical line at # heads = 16:

What’s included is all the coins with biases between .269 and .542. That’s why the confidence interval for 16 out of 40 is 26.9% to 54.2%.

The replication game

The replication crisis in psychology started with a bang with Daryl Bem’s 2011 paper Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. In this paper, Bem performed experiments that appeared to prove the existence of extrasensory perception (ESP). For example, subjects could guess (better than chance) where erotic images would appear on a screen. There were thousands of subjects. The paper used simple, standard statistics methods, and found significance for eight of nine experiments. Bem is a respected researcher, who no one suspects of anything underhanded.

The problem, of course, is that ESP isn’t real.

The psychology community was shaken. A smart person had carefully followed standard research practices and proved something impossible. This was ominous. Bem’s paper was only questioned because the result was impossible. How many other false results were lurking undetected in the literature?

Today, we’d say he had wandered the garden of forking paths: If you look at enough datasets from enough angles, you’ll always find significance eventually. It’s easy to do this even if you have the best intentions.

Many psychologists saw the Bem affair as a crisis and quickly moved to assess the damage. In 2014, an entire issue of Social Psychology was dedicated to registered replications. Each paper was both a replication (a repeat of a previous experiment) and pre-registered (the research plan and statistical analysis were peer-reviewed before the experiment was done).

So how well did existing results replicate? In their introduction to the issue, Nosek and Lakens are rather coy:

This special issue contains several replications of textbook studies, sometimes with surprising results.

Awesome, but which ones were surprising, how surprising were they, and in what way?

I wanted a number, a summary, something I could use as a heuristic. So, I read all the papers and rated every experiment in every paper as either a full success (★), a success with a much smaller effect size (☆), or a failure (✗). I used the summary from the authors if they gave one, which they often didn’t, probably because people are squeamish about yelling my famous colleague’s Science paper is bogus too loudly. Then, for each effect, I averaged the scores for all the experiments using ✗=0, ☆=2, and ★=4.

So, let’s play a game.

I’ll give a short description of the effects that were tested. You can predict how well each replicated by choosing an integer between 0 (failure) and 4 (full success). You can score yourself manually, or there’s a form you can use below.

(Or, if you’re the impatient type, feel free to jump to the summary.)

The effects

The first few effects are textbook results. (📖)

(1) The primacy of warmth is the idea that when forming impressions of others, the "warmth" of someone’s personality is more important than competence. This claim goes back to Solomon Asch in 1946 where people were given lists of various attributes for people like "intelligent, skillful, industrious, warm, determined, practical, cautious" and asked which was most important.

(2) Deviation-rejection is the idea that if someone stubbornly holds an opinion contrary to a group consensus will ultimately be rejected from that group. Schachter (1951) ran focus groups to discuss what to do with a juvenile delinquent. Most people are lenient, but a confederate in the group always supported a harsh punishment. This confederate was talked to less over time, and chosen less for future groups.

(3) The Romeo and Juliet effect is the idea that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972), who found that parental interference was correlated with love and trust in relationships and that, over time, increases in interference were correlated with an increase in love.

(4) Single exposure musical conditioning was investigated by Gorn (1982) by randomly showing people either a blue or beige pen, while playing either music they liked or disliked. Later, people could choose a pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, while only 30% of people who hear music they disliked did the same.

The next few effects concerned papers where the literature is somewhat in conflict or where there are moderators with unclear impact. (⚔️)

(5) Stereotype priming was investigated by Blair and Banaji (1996) who found that briefly flashing a stereotypically "male" or "female" word in front of someone changed how long it took to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found an effect even if the discrimination was unrelated to the primes: "Male" words made participants faster to recognize male names even if the task was to discriminate city names vs. first names.

(6) Stereotype susceptibility (or stereotype threat) is the idea that cognitive performance is influenced by stereotypes. Shih et al (1999). took Asian-American women and had them take a math test, while being primed to either think about being women or about being Asian. The Asian-primed group 54% of the questions right while the female-primed group got only 43% right. However, these results came from a small sample, and results in real-world applications have been mixed.

(7) Sex differences in distress from infidelity. A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at a population with a mean age of 20 years asked if people would be more distressed if their partner had passionate sex with someone else or formed a deep emotional attachment. 76% of young men chose sex, as opposed to 32% of women. In an older population with a mean age of 67 years, 68% of men chose sex, as opposed to 49% of women.

(8) Psychological distance and moral judgment is the idea that people make more intense moral judgments (positive or negative) for more psychologically distant acts. Eval et al. (2008) had participants make judgments about people who committed various acts (incest, eating a dead pet, donating to charity) varying if it happened in the near vs. distant future. People judged those in the distant future around 1.5 points more harshly on a -5 to +5 scale. However, Gong and Medin (2012) performed a similar experiment and came to the opposite conclusion.

Finally, there were recent studies that had attracted a lot of attention. (🔥)

(9) Does cleanliness influence moral judgments? Schnall et al. (2008) found that people make less severe moral judgments when they feel clean. They primed people with words and then asked people to judge a set of moral dilemmas, e.g. trolley problems or faking information on a resume. They found that people primed with "clean" words judged people around 1 point less harshly than those primed with neutral words, on a 7-point scale. In a second experiment, people saw a clip from Trainspotting with a man using an unclean toilet and were asked to wash their hands (or not) before being given the same moral dilemmas. Those who washed their hands judged people around 0.7 points less harshly.

(10) Does physical warmth promote interpersonal warmth? Williams and Barg (2008) gave participants either a cold pack or a heat pack to evaluate after which they could choose a gift either for themselves or for a friend. Those given the heat pack were around 3.5x more likely to choose a gift for a friend.

(11) Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable. Sachdeva et al. (2009) had participants write a short story about themselves using words that were either positive words ("caring", "generous") or negative words ("disloyal", "greedy"). Afterward, they asked how much they would be willing to donate to charity. Those who used the positive words donated a mean of $1.07, while those who used negative words donated $5.30.

(12) Can superstition improve performance? Damish et al. (2010) gave subjects a golf ball and asked them to attempt a 100cm putt. Subjects told the ball had been lucky so far were able to make 65% of putts, as opposed to 48% of controls.

(13) Is moral behavior connected to brightness? Banerjee et al. (2012) asked participants to describe something they did in the past that was either ethical or unethical, then asked them how bright their room was. Those in the ethical group described their room as around 0.6 points brighter on a 7-point scale. They also judged various light-emitting products (lamps, candles, flashlights) as around 2-point more desirable on a 7-point scale.


To play the game fill out this Google form quiz thing. f

Or, you can score yourself. In the following table, record your absolute error for each effect. (If you guessed 1 and the correct answer was 4, you get 3 points.) Then add all these errors up.

The best possible score is 0, while the worst is 41.

Here are the results for each effect, ranging from ○○○○ (failure) to ●●●● (full replication).

Effect Type Result
Primacy of warmth 📖 ○○○○
Deviation-rejection 📖 ●●○○
Romeo and Juliet effect 📖 ○○○○
Single exposure musical conditioning 📖 ●●●●
Stereotype priming ⚔️ ●○○○
Stereotype susceptibility ⚔️ ●○○○
Sex differences in distress from infidelity ⚔️ ●○○○
Psychological distance and moral judgment ⚔️ ●●○○
Cleanliness and moral judgments 🔥 ○○○○
Physical warmth and interpersonal warmth 🔥 ○○○○
Moral Licensing 🔥 ○○○○
Superstition and performance 🔥 ○○○○
Moral behavior and brightness 🔥 ○○○○

What to make of this?

Obviously, the replication rate is strikingly low. Only one effect replicated at the original effect size, and even this has some caveats (below). The average replication score was 0.8 / 4. And one effect (Romeo and Juliet) deserves a negative score: It didn’t just fail to replicate, there was a large effect in the opposite direction.

Also striking is that all of the new hot effects (🔥) failed—not one experiment in one paper replicated even at a smaller effect size. Strangely, all the conflicted papers (⚔️) replicated at least a little bit.

I don’t think things are nearly as bad as this table suggests. For one thing, if you were a psychologist, what would make you perform a registered replication? Personally, I’d be motivated by something like this:

someone is wrong on the internet

I’d do it because I thought some famous effect wasn’t real. So this is a highly non-random sample. The correct conclusion is less psychology is all wrong and more when psychologists think an effect is bogus, they’re usually right.

Second, these results are mostly social psychology. You, dear reader, are probably not a social psychologist, but rather someone interested in the broad art of predicting stuff, which is the domain of cognitive psychology. (At least, unless you’re predicting how people interact in groups.) The Open Science Collaboration in 2015 tried to reproduce a bunch of recent papers in social and cognitive psychology. They found that around 28% of social effects subjectively "replicated", as opposed to around 53% of cognitive effects. They also found that replication effect sizes were much closer to the original effect sizes for cognitive psychology (a ratio of around 0.56 as compared to 0.31 in social psychology).

The replications

Here are more details about what happened in each of the replications, and justifications of the ratings I gave. (Click to expand.) If you’re the author of one of these replications and I got anything wrong, I’d love to hear from you.

The primacy of warmth

This claim goes back to Solomon Asch in 1946. The idea is that, when forming impressions of people, warmth-related judgements are more important than competence. Nauts et al. replicated Asch’s experiments. They showed different people various lists of traits, such as the following

  • (Condition 1) Intelligent, skillful, industrious, warm, determined, practical, cautious
  • (Condition 2) Intelligent, skillful, industrious, cold, determined, practical, cautious
  • (Condition 3) Obedient, weak, shallow, warm, unambitious, vain
  • (Condition 4) Vain, shrewd, unscrupulous, warm, shallow, envious
  • (Condition 5) Intelligent, skillful, sincere, cold, conscientious, helpful, modest

The fraction of people that chose "warm" or "cold" as the most important trait was as follows:

Condition most chosen trait choosing warm/cold
1 intelligent 55.3% 19.5%
2 intelligent, 36.2% 30.0%
3 obedient 21.7% 7.0%
4 vain 44.0% 6.6%
5 intelligent 53.5% 7.8%

According to Asch’s theory, people should choose "warm" and "cold" as the most important traits in conditions 1 and 2, but not in the others. It is true that warm/cold were considered more important in these conditions, but they were never the most common choice as the most important trait. Nauts et al. call this a clear failure of this particular experiment to replicate, though they emphasize that other research makes clear that warmth is indeed primary in many circumstances.

Summary: ✗ Total replication failure.


Wesselmann et al. investigated "deviation-rejection", or the claim that someone who holds an opinion contrary to a group consensus will ultimately be rejected from that group. Following Schachter (1951), they created groups of 10 people, consisting of 7 experimental subjects and three confederates. Everyone was given a case study of a juvenile delinquent named Johnny Rocco, then asked everyone how Johnny should be treated, followed by a discussion. Most subjects are lenient. The mean confederate followed the current group consensus. The slider confederate first supported harsh treatment, then gradually shifted towards leniency. The deviate confederate always said to punish Johnny and never changed.

They looked at three claims made by Schachter. First, did people communicate with the deviate confederate less over time? They did seem to, though the data was noisy.

Second, they looked at if people would assign the confederates to leadership roles in future committees. Contra Schachter, they found no effect.

Third, they had people rank each person for how much they’d like to have them in future groups. On a scale of 1-3, the slider got a score of 1.74, the mode a score of 1.91, and the deviate a score of 2.34. (So people dislike nonconformists, but like people who are willing to change their mind?) This replicates Schachter again, albeit with a significantly smaller effect size.

Summary: ☆ Replication mostly succeeded, but with a smaller effect size

The Romeo and Juliet effect

The Romeo and Juliet effect is the claim that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972). Who found that parental interference was correlated with love and trust, and also that over time, increases in parental interference were correlated with increases in love.

Sinclair et al. replicated these experiments. They found people online who were in relationships and asked them about their relationship, e.g. how much they loved their partner, how much they trusted their partner. They also asked how much friends or parents disapprove of their relationship. They contacted these people again around 4 months later. They then looked at correlations.

Their results were the opposite of what Discoll et al. found. Greater approval from friends and parents was associated with higher relationship quality, not lower. And increased parental disapproval was correlated with decreased love.

Summary: ✗ Calling this just a failure is generous. The original effect not only failed to replicate, but the opposite effect held (with a large effect size and high statistical significance).

Single-exposure musical conditioning

Gorn (1982) showed were randomly shown either a blue or beige pen, while playing either music they liked or disliked. Then, later, they could choose one pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, compared with only 30% of people who heard music they disliked.

Vermeulen et al. reproduced this result. In a first experiment, they used similar music to the original music from 1982: They first used Summer Nights from Grease as the "liked" music, and Aaj Kaun Gali Gayo Shyam by Parveen Sultana as the "disliked" music. They confirmed that people really did like the Grease music more than the other song (they had mean ratings of 3.72 vs 2.11 on a scale of 1-7)

After this, they repeated the same study, but with an actor pretending to be a researcher from another department and a post-experimental questionnaire. This again found no real effect.

Finally, maybe the problem was just using old music that college students are unfamiliar with? To test this, they got two different renditions of Rihanna’s We Found Love by cover artists, and selected one (that people liked) as the "liked" music and one (that people didn’t like) as the "disliked" music. People liked the good rendition much more than the other (mean score of 5.60 vs. 2.48).

For this third experiment, they ran out of students in the class, and so had fewer subjects than planned. Still, they found that 57% of people who heard the "liked" music chose the pen on screen, as opposed to only 23% of people who heard the "disliked" music. Despite the smaller sample, this was still highly significant, with p=.003.

It’s unfortunate that they ran out of subjects for the third experiment. Still, I think this mostly rescues the effect: Students didn’t really like the music from Grease in the first two experiments, they just disliked it slightly less (3.72 is a low score on a scale of 1-7!) The third experiment is the only one where there’s a big difference in how much people like the music, and there’s a big effect there. It’s unfortunate they ran out of subjects, though!

Summary: ★? The authors call this a "somewhat unreliable successful replication".

Stereotype priming

Blair and Banaji (1996) claimed that if you briefly flashed a stereotypically "male" or "female" word in front of someone, that would change how long it would take people to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found this could have an effect even if the discrimination was unrelated to the primes (e.g. discriminating cities vs. names).

Müller and Rothermund set out to replicate these effects. They had people come into the lab and fixate on a screen. Then, they’d briefly be shown either stereotypically "male" or "female" priming words. Example male primes are "computer", "to fight", and "aggressive", while example female primes are "ballet", "to put on make-up", and "gossipy". These primes were shown for only 200 ms.

In a first experiment, the prime was followed by either a male name ("Achim", this happened in Germany) or a female name ("Annette"), which subjects needed to classify as quickly as possible. Here were the mean times (and standard deviations) in ms

Target Gender Male Prime Female Prime
male 554 (80) 566 (80)
female 562 (83) 549 (80)

There was a significant effect—albeit a small one. However, Blair and Banaji also found a small effect, although around 2x or 3x larger than this one.

A second experiment was the same, except now they would either see a (male or female) first name 50% of the time, and a city name (e.g. "Aachen") 50% of the time. Now, subjects needed to distinguish a first name from a city name.

Target Gender Male Prime Female Prime
male 605 (91) 605 (90)
female 570 (86) 567 (85)

In this analysis, they simply ignore all trials where a city was shown, so this table is showing how long it takes to recognize male/female names as names. For whatever reason, people found it to be harder to recognize male names, but priming had no effect on this. In contrast, Banaji and Hardin had found that changing the prime would have an effect of around 14 ms.

Summary: ☆ / ✗ Half the replication failed, the other half succeeded with a smaller effect size.

Stereotype susceptibility

Shih et al. (1999) took Asian-American women and had them take a math test. Before the math test, some were primed to think about being women by being given questions about coed or single-sex living arrangements. Others were primed to think about being Asian by answering questions about family and ethnicity. They found that the Asian-primed group 54% right while the female-primed group got 43% right. (The control group got 49%.)

Gibson et al. replicated this at six universities in the Southeastern US. With a sample of 156 subjects (as opposed to only 16) they found that the Asian-primed group go 59% right, while the female-primed group got 53% right. This was smaller and nonsignificant (p=.08). They then excluded participants who weren’t aware of stereotypes regarding math and Asians/women. Under the remaining 127 subjects, the Asian-primed group got 63% right, while the female-primed group got 51% right, and the effect was significant (p=.02).

But then, in a second article, Moon and Roeder tried to replicate exactly the same result using the same experimental protocol. They found that the Asian-primed group got 46% correct, while the female-primed group got 43% correct. This difference was nonsignificant (p=.44). However, in this same experiment, the control group got 50% correct.

Among only those aware of the stereotype, the Asian-primed group got 47% correct, while the female-primed group got 43%. Both of these results were nonsignificant (p=.44, and p=.28, respectively). Here again, the control group got 50%. The higher performance in the control group is inconsistent with the theory of priming, so this is a conclusive failure.

Summary: ☆? / ✗ The first replication basically half-succeeded, while the second failed.

Sex differences in distress from infidelity

A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at datasets from two populations, one with a mean age of 20 years and one with a mean age of 67. They found that in both populations males were more distressed by infidelity than females, though the difference was smaller in the older population.

Hans IJzerman et al. replicated these experiments. In the younger population, they successfully replicated the result. In the older population, they did not replicate the result.

Summary: ☆ / ✗ One successful replication with a smaller effect, and one failed replication.

Psychological distance and moral judgment

Eval et al. (2008) claimed that people made more intense moral judgments for acts that were psychologically distant. Gong and Medin (2012) came to the opposite conclusion.

Žeželj and Jokić replicated this experiment. They had subjects make judgments about the actions of people in hypothetical scenarios. In a first experiment, they described incest or eating a dead pet, but varied if they happened now or in the distant future. Contra Eval et al., future transgressions were judged similarly to near-future transgressions. (Near future was judged 0.12 points more harshly on a -5 to +.5 scale, so the effect was actually in the wrong direction.)

In a second experiment, they instead varied in if subjects were asked to think in the first-person about a specific person they knew performing the act, or to focus on their thoughts and think about it from a third-person perspective. All scenarios showed that people were more harsh when thinking about things from a distance. The difference was around 0.44 averaged over scenarios. This magnitude was significant and similar to what Eval et al.’s research predicted.

A third experiment was similar to the first in that time was varied. The difference was that the scenarios concerned virtuous acts with complications, e.g. a company making a donation to the poor that improves their sales. They actually found the opposite effect that Eval et al. would have predicted: the distant future acts were judged less virtuous. The difference was only 0.32 and not significant.

In a fourth experiment, participants varied in if they were primed by initial questions to be in a high-level or low-level mindset. Here, they found that those primed to be a low-level were more harsh than those at high-level. This was statistically significant and consistent with the predictions of Eval et al., albeit around half the magnitude of effect.

Summary: ✗ ✗ ☆ ★ Two clear failures, one success with a smaller effect, and one success with a similar effect. This should be an average score of 1.5/4, but to keep everything integer, I’ve scored it as 2/4 above.

Cleanliness and moral judgments

Schnall et al. (2008) claimed that people make less severe moral judgments when they feel clean.

Johnson et al. replicated these experiments. Participants first completed a puzzle that had either neutral words or cleanliness words, and then responded to a sequence of moral dilemmas. They found no effect at all from being primed with cleanliness words.

In a second experiment, they watched a clip from Trainspotting with a man using an unclean toilet. They were then asked to wash their hands (or not) and then asked about the same moral dilemmas. They found no effect at all from being assigned to wash your hands.

Summary: ✗ Clear failure

Physical warmth and interpersonal warmth

Williams and Barg (2008) published an article in Science that claimed that people who were physically warm would behave more pro-socially.

Lynott et al. replicated this. Participants were randomly given either a cold pack or a heat pack to evaluate, and then could either choose a gift for a friend or for themselves. Williams and Barg found that those given heat were around 3.5x as likely to be pro-social. In the replication, they were actually slightly less likely.

Summary: ✗ Clear failure

Moral licensing

Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable.

Blanken et al. reproduced a set of experiments by Sachdeva et al. (2009) In a first experiment, they had participants were induced to write a short story about themselves using words that were either positive, neutral, or negative. Afterward, they asked how much they would be willing to donate to charity. Contra previous work, they found people with positive words were willing to donate slightly more (not significant).

A second experiment was similar except rather than being asked to donate to charity, participants imagined they ran a factory and were asked if they would run a costly filter to reduce pollution. Again, if anything the effect was the opposite of predicted, though it was non-significant.

In a third experiment, they used an online sample with many more subjects, and asked both of the previous questions. For running the filter, they found no effect. For donations, they found that there was no difference between neutral and positive priming, but people who were negatively primed did donate slightly more, and this was statistically significant (p=.044).

Arguably this is one successful replication, but let’s be careful: They basically ran four different experiments (all combinations of donations / running-filters and in-person / online subjects). For each of these they had three different comparisons ( positive-vs-neutral / positive-vs-negative / neutral-vs-negative). That’s a lot of opportunities for false discovery, and the one effect that was found is just barely significant.

Summary: ✗ ✗ ✗? Two clear failures and one failure that you could maybe / possibly argue is a success.

Superstition and performance

Damish et al. (2010) found that manipulating superstitious feelings could have dramatic effects on golfing performance. Subjects told that a ball was lucky were able to make 65% of 100cm putts, as opposed to 48% of controls.

Calin-Jageman and Caldwell reproduced this experiment. They found that the superstition-primed group was only 2% more accurate, which was not significant.

In a second experiment, they tried to make the "lucky" group feel even luckier by having a ball with a shamrock on it and saying "wow! you get to use the lucky ball". Again, there was no impact.

Summary: ✗ Clear failure

Moral behavior and brightness

Banerjee et al (2012) found that recalling unethical behavior caused people to see the room as darker.

Brant et al. replicated this. Participants were first asked to describe something they did in the past that was either ethical or unethical. In a first study, they were then asked about how bright their room was. In a second study, they were instead asked how desirable lamps, candles, and flashlights were.

They found nothing. Recalling ethical vs. unethical behavior had no effect on the estimated brightness of the room, or how much people wanted light-emitting products.

Summary: ✗ Clear failure

Statistics – the rules of the game

What is statistics about, really? It’s easy to go through a class and get the impression that it’s about manipulating intimidating formulas. But what’s the goal of them? Why did people invent them?

If you zoom out, the big picture is more conceptual than mathematical. Statistics has a crazy, grasping ambition: it wants to tell you how to best use observations to make decisions. For example, you might look at how much it rained each day in the last week, and decide if you should bring an umbrella today. Statistics converts data into ideal actions.

Here, I’ll try to explain this view. I think it’s possible to be quite precise about this while using almost no statistics background and extremely minimal math.

The two important characters that we meet are decision rules and loss functions. Informally, a decision rule is just some procedure that looks at a dataset and makes a choice. A loss function — a basic concept from decision theory– is a precise description of “how bad” a given choice is.

Model Problem: Coinflips

Let’s say you’re confronted with a coin where the odds of heads and tails are not known ahead of time. Still, you are allowed to observe how the coin performs over a number of flips. After that, you’ll need to make a “decision” about the coin. Explicitly:

  • You’ve got a coin, which comes up heads with probability w. You don’t know w.
  • You flip the coin n times.
  • You see k heads and n-k tails.
  • You do something, depending on k. (We’ll come back to this.)

Simple enough, right? Remember, k is the total number of heads after n flips. If you do some math, you can work out a formula for p(k\vert w,n): the probability of seeing exactly k heads. For our purposes, it doesn’t really matter what that formula is, just that it exists. It’s known as a Binomial distribution, and so is sometimes written \mathrm{Binomial}(k\vert n,w).

Here’s an example of what this looks like with n=21 and w=0.5.


Naturally enough, if w=.5, with 21 flips, you tend to see around 10-11 heads. Here’s an example with w=0.2. Here, the most common value is 4, close to 21\times.2=4.2.


Decisions, decisions

After observing some coin flips, what do we do next? You can imagine facing various possible situations, but we will use the following:

Our situation: After observing n coin flips, you need to guess “heads” or “tails”, for one final coin flip.

Here, you just need to “decide” what the next flip will be. You could face many other decisions, e.g. guessing the true value of w.

Now, suppose that you have a friend who seems very skilled at predicting the final coinflip. What information would you need to reproduce your friend’s skill? All you need to know is if your friend predicts heads or tails for each possible value of k. We think of this as a decision rule, which we write abstractly as


This is just a function of one integer k. You can think of this as just a list of what guess to make, for each possible observation, for example:

k \mathrm{Dec}(k)
0 \mathrm{tails}
1 \mathrm{heads}
2 \mathrm{heads}
\vdots \vdots
n \mathrm{tails}

One simple decision rule would be to just predict heads if you saw more heads than tails, i.e. to use

\mathrm{Dec}(k)=\begin{cases} \mathrm{heads}, & k\geq n/2 \\ \mathrm{tails} & k<n/2 \end{cases}.

The goal of statistics is to find the best decision rule, or at least a good one. The rule above is intuitive, but not necessarily the best. And… wait a second… what does it even mean for one decision rule to be “better” than another?

Our goal: minimize the thing that’s designed to be minimized

What happens after you make a prediction? Consider our running example. There are many possibilities, but here are two of the simplest:

  • Loss A: If you predicted wrong, you lose a dollar. If you predicted correctly, nothing happens.
  • Loss B: If you predict “tails” and “heads” comes up, you lose 10 dollars. If you predict “heads” and “tails” comes up, you lose 1 dollar. If you predict correctly, nothing happens.

We abstract these through a concept of a loss function. We write this as


The first input is the true (unknown) value w, while second input is the “prediction” you made. We want the loss to be small.

Now, one point might be confusing. We defined our situation as predicting the next coinflip, but now L is defined comparing d to w, not to a new coinflip. We do this because comparing to w gives the most generality. To deal with our situation, just use the average amount of money you’d lose if the true value of the coin were w. Take loss A. If you predict “tails”, you’ll be wrong with probability w, while if you predict “heads”, you’ll be wrong with probability 1-w, and so lose 1-w dollars on average. This leads to the loss

L_{A}(w,d)=\begin{cases} w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

For loss B, the situation is slightly different, in that you lose 10 times as much in the first case. Thus, the loss is

L_{B}(w,d)=\begin{cases} 10w & d=\mathrm{tails}\\ 1-w & d=\mathrm{heads} \end{cases}.

The definition of a loss function might feel circular– we minimize the loss because we defined the loss as the thing that we want to minimize. What’s going on? Well, a statistical problem has two separate parts: a model of the data generating process, and a loss function describing your goals. Neither of these things is determined by the other.

So, the loss function is part of the problem. Statistics wants to give you what you want. But you need to tell statistics what that is.

Despite the name, a “loss” can be negative– you still just want to minimize it. Machine learning, always optimistic, favors “reward” functions that are to be maximized. Plus ça change.

Model + Loss = Risk

OK! So, we’ve got a model of our data generating process, and we specified some loss function. For a given w, we know the distribution over k, so… I guess… we want to minimize it?

Let’s define the risk to be the average loss that a decision rule gives for a particular value of w. That is,

R(w,\mathrm{Dec})=\sum_{k}p(k\vert w,n)L(w,\mathrm{Dec}(k)).

Here, the second input to R is a decision rule– a precise recipe of what decision to make in each possible situation.

Let’s visualize this. As a set of possible decision rules, I will just consider rules that predict “heads” if they’ve seen at least m heads, and “tails” otherwise:

\mathrm{Dec}_{m}(k)=\begin{cases} \mathrm{heads} & k\geq m\\ \mathrm{tails} & k<m \end{cases}.

With n=21 there are 22 such decision rules, corresponding to m=0, (always predict heads), m=1 (predict heads if you see at least one heads), up to m=22 (always predict tails). These are shown here:

These rules are intuitive: if you’d predict heads after observing 16 heads out of 21, it would be odd to predict tails after seeing 17 instead! It’s true that for losses L_{A} and L_{B}, you don’t lose anything by restricting to this kind of decision rule. However, there are losses for which these decision rules are not enough. (Imagine you lose more when your guess is correct.)

With those decision rules in place, we can visualize what risk looks like. Here, I fix w=0.2, and I sweep through all the decision rules (by changing m) with loss L_{A}:


The value R_A in the bottom plot is the total area of the green bars in the middle. You can do the same sweep for w=0.4, which you can is pictured here:


We can visualize the risk in one figure with various w and m. Notice that the curves for w=0.2 and w=0.4 are exactly the same as we saw above.


Of course, we get a different risk depending on what loss function we use. If we repeat the whole process using loss L_{B} we get the following:


Dealing with risk

What’s the point of risk? It tells us how good a decision rule is. We want a decision rule where risk is as low as possible. So you might ask, why not just choose the decision rule that minimizes R(w,\mathrm{Dec})?

The answer is: because we don’t know w! How do we deal with that? Believe it or not, there isn’t a single well-agreed upon “right” thing to do, and so we meet two different schools of thought.

Option 1 : All probability all the time

Bayesian statistics (don’t ask about the name) defines a “prior” distribution p(w) over w. This says which values of w we think are more and less likely. Then, we define the Bayesian risk as the average of R over the prior:


This just amounts to “averaging” over all the risk curves, weighted by how “probable” we think w is. Here’s the Bayes risk corresponding to L_{A} with a uniform prior p(w)=1:


For reference, the risk curves R(w,\mathrm{Dec}_m) are shown in light grey. Naturally enough, for each value of m, the Bayes risk is just the average of the regular risks for each w.

Here’s the risk corresponding to L_{B}:


That’s all quite natural. But we haven’t really searched through all the decision rules, only the simple ones \mathrm{Dec}_m. For other losses, these simple ones might not be enough, and there are a lot of decision rules. (Even for this toy problem there are 2^{22}, since you can output heads or tails for each of k=0, k=1, …, k=21.)

Fortunately, we can get a formula for the best decision rule for any loss. First, re-write the Bayes risk as

R_{\mathrm{Bayes}}(\mathrm{Dec})=\sum_{k} \left( \int_{w=0}^{1}p(w)p(k\vert n,w)L(w,\mathrm{Dec}(k))dw \right).

This is a sum over k where each term only depends on a single value \mathrm{Dec}(k). So, we just need to make the best decision for each individual value of k separately. This leads to the Bayes-optimal decision rule of

\mathrm{Dec}_{\text{Bayes}}(k)=\arg\min_{d}\int_{w=0}^{1}p(w)p(k\vert w,n)L(w,d)dw.

With a uniform prior p(w)=1, here’s the optimal Bayesian decision rules with loss L_{A}:


And here it is for loss L_B:


Look at that! Just mechanically plugging the loss function into the Bayes-optimal decision rule naturally gives us the behavior we expected– for L_{B}, the rule is very hesitant to predict tails, since the loss is so high if you’re wrong. (Again, these happen to fit in the parameterized family \mathrm{Dec}_{m} defined above, but we didn’t use this assumption in deriving the rules.)

The nice thing about the Bayesian approach is that it’s so systematic. No creativity or cleverness is required. If you specify the data generating process (p(k\vert w,n)) the loss function (L(w,d)) and the prior distribution (p(w)) then the optimal Bayesian decision rule is determined.

There are some disadvantages as well:

  • Firstly, you need to make up the prior, and if you do a terrible job, you’ll get a poor decision rule. If you have little prior knowledge, this can feel incredibly arbitrary. (Imagine you’re trying to estimate Big G.) Different people can have different priors, and then get different results.
  • Actually computing the decision rule requires doing an integral over w, which can be tricky in practice.
  • Even if your prior is good, the decision rule is only optimal when averaged over the prior. Suppose, for every day for the next 10,000 years, a random coin is created with w drawn from p(w). Then, no decision rule will incur less loss than \mathrm{Dec}_{\text{Bayes}}. However, on any particular day, some other decision rule could certainly be better.

So, if you have little idea of your prior, and/or you’re only making a single decision, you might not find much comfort in the Bayesian guarantee.

Some argue that these aren’t really disadvantages. Prediction is impossible without some assumptions, and priors are upfront and explicit. And no method can be optimal for every single day. If you just can’t handle that the risk isn’t optimal for each individual trial, then… maybe go for a walk or something?

Option 2 : Be pessimistic

Frequentist statistics (Why “frequentist”? Don’t think about it!) often takes a different path. Instead of defining a prior over w, let’s take a worst-case view. Let’s define the worst-case risk as


Then, we’d like to choose an estimator to minimize the worst-case risk. We call this a “minimax” estimator since we minimize the max (worst-case) risk.

Let’s visualize this with our running example and L_{A}:


As you can see, for each individual decision rule, it searches over the space of parameters w to find the worst case. We can visualize the risk with L_{B} as:


What’s the corresponding minimax decision rule? This is a little tricky to deal with– to see why, let’s expand the worst-case risk a bit more:

R_{\mathrm{Worst}}(\mathrm{Dec})=\max_{w}\sum_{k}p(k\vert n,w)L(w,\mathrm{Dec}(k)).

Unfortunately, we can’t interchange the max and the sum, like we did with the integral and the sum for Bayesian decision rules. This makes it more difficult to write down a closed-form solution. At least in this case, we can still find the best decision rule by searching over our simple rules \mathrm{Dec}_m. But be very mindful that this doesn’t work in general!

For L_{A} we end up with the same decision rule as when minimizing Bayesian risk:


For L_{B}, meanwhile, we get something slightly different:


This is even more conservative than the Bayesian decision rule. \mathrm{Dec}_{B-\mathrm{Bayes}}(2)=\mathrm{tails}, while \mathrm{Dec}_{B-\mathrm{minimax}}(2)=\mathrm{heads}. That is, the Bayesian method predicts heads when it observes 2 or more, while the minimax rule predicts heads if it observes even one. This makes sense intuitively: The minimax decision rule proceeds as if the “worst” w (a small number) is fixed, whereas the Bayesian decision rule less pessimistically averages over all w.

Which decision rule will work better? Well, if w happens to be near the worst-case value, the minimax rule will be better. If you repeat the whole experiment many times with w drawn from the prior, the Bayesian decision rule will be.

If you do the experiment at some w far from the worst-case value, or you repeat the experiment many times with w drawn from a distribution different from your prior, then you have no guarantees.

Neither approach is “better” than the other, they just provide different guarantees. You need to choose what guarantee you want. (You can kind of think of this as a “meta” loss.)

So what about all those formulas, then?

For real problems, the data generating process is usually much more complex than a Binomial. The “decision” is usually more complex than predicting a coinflip– the most common decision is making a guess for the value of w. Even calculating R(w,\mathrm{Dec}) for fixed w and \mathrm{Dec} is often computationally hard, since you need to integrate over all possible observations. In general, finding exact Bayes or minimax optimal decision rules is a huge computational challenge, and at least some degree of approximation is required. That’s the game, that’s why statistics is hard. Still, even for complex situations the rules are the same– you win by finding a decision rule with low risk.

Linear Classifiers and Loss Functions

A linear classifier is probably the simplest machine learning technique. Given some input vector \bf x, predict some output y. One trains a vector of “weights” \bf w, that determine the behavior of the classifier. Given some new input \bf x, the prediction f will be:

f({\bf x}) = I[{\bf w}^T {\bf x}>0]

Here, I[\text{expr}] is the indicator function– 1 if \text{expr} is true, and -1 otherwise. So, y is one of two classes.

Often the classifier will be written like I[{\bf w}^T {\bf x}+b>0], i.e. including a “bias” term b. Here I’ll ignore this– if you want a bias term, just append a 1 to every vector \bf x.

Now, it seems like nothing could be simpler than a linear classifier, but there is considerable subtlety in how to train a linear classifier. Given some training data \{({\hat {\bf x}},{\hat y})\}, how do we find the vector of weights \bf w?

There are various popular ways of doing this. Three popular techniques are:

  • The Perceptron Algorithm
  • Logistic Regression
  • Support Vector Machines (Hinge loss)

The perceptron algorithm works, roughly speaking as follows: Go through the training data, element by element. If one datum is misclassified, slightly nudge the weights so that it is closer to being correct. This algorithm has the remarkable property that, if the training data is linearly separable, it will find a \bf w that correctly classifies it in a finite number of iterations. (The number of iterations that it takes depends on the “how separable” the data is, making fewer mistakes on data with a larger margin.)

Unfortunately, the assumption that the training data is linearly separable is basically never true in practice. There is still a bound in terms of how much points would need to be moved in order to be linearly separable, but it isn’t as strong as we might like. (I don’t know what the folklore is for how this tends to work in practice.)

Logistic regression and support vector machines take a different approach. There, a “loss function” is defined in terms of the weights, and one just optimizes the loss to find the best weights. (I’m ignoring regularization here for simplicity.) For logistic regression, the loss is

L = -\sum_{({\hat {\bf x}},{\hat y})} \log p({\hat y}|{\hat {\bf x}})

where p(y=1|{\bf x})=\sigma({\bf w}^T {\bf x}), for a sigmoid function \sigma, and p(y=-1|{\bf x})=1-p(y=1|{\bf x}).

For a support vector machine, the loss is

L = \sum_{({\hat {\bf x}},{\hat y})} (1-{\hat y} \cdot {\bf w}^T {\hat {\bf x}})_+

where (a)_+ is a if a>0 and 0 otherwise. This is a hinge loss. Notice that it will be zero if {\hat y} \cdot {\bf w}^T {\hat {\bf x}} > 1, or if that particular training element is comfortably on the correct side of the decision boundary. Otherwise, the “pain” is proportional to “how wrong” the classification is.

Both the hinge-loss and logistic regression are convex loss functions. This means that if we apply a nonlinear search procedure to find a local minima, that will also be the global minima.

These different loss functions are compared here.


Now, the critical point about these three above methods is that none of them do the seemingly obvious thing: find the vector of weights {\bf w} that has the lowest classification error on the training data. Why not? Some would defend logistic regression or SVM for theoretical reasons (namely a meaningful probabilistic interpretation, and the reasonableness and theoretical guarantees of max-margin methods, respectively).

However, probably the more significant hurdle is computational considerations. Namely, the problem of finding the weights with lowest classification error is (in order of increasing horribleness) non-convex, non-differentiable, and NP-hard.

In fact it is NP-hard even to find an approximate solution, with worst-case guarantees. (See citation 2 below, which, interestingly, gives an approximate algorithm in terms of a property of the data.)

Nevertheless, if classification error is what we want, I don’t see how it makes sense to minimize some alternative loss function. As such, I decided today to try the following algorithm.

  1. Apply an SVM to get an initial solution.
  2. Apply heuristic search to minimize classification error, initialized to the solution of step 1.

Now, I am well aware of the problems with non-convex optimization. However, simply the fact that logistic regression or hinge loss is convex is not an argument in their favor. If theoretical considerations dictate that we minimize classification error, just substituting a different loss, and then refusing to look at the classification rate is highly questionable. Sure, that can lead to a convex optimization problem, but at the that’s because a different problem is being solved! The goal is accurate prediction of future data, not accurate minimization of a loss on the training data. If we use our best convex approximation to initialize the heuristic optimization of the loss we really want, we will never do worse.


There are some heuristics available for doing this (Reference 3), but they seem a little expensive. I decided to try something really simple.

  1. Fit a classifier (by hinge loss or logistic regression).
  2. Throw out a few of the most badly misclassified points.
  3. Repeat until all remaining points are correctly classified.

The idea is that, by “giving up” on badly misclassified points, the boundary might be movable into a position where it correctly classifies other points. There is no guarantee in general that this will find a linear classifier with a lower misclassification rate, but it should not do worse.

To test this out, I got some data from the MNIST handwritten digit database. To make the problem harder, I took one class to be random images from either the set of 1’s or 2’s, while the other class was 3’s or 4’s. The images were downsampled to 7×7, and a constant of one added. So we are looking for a linear classifier in a 50 dimensional space.

The data was trained on a database of 10,000 examples, with a test set of 10,000 examples.

Here are the results where we throw out the worst 100 training points in step 2:

Here are the results when we throw out 20 at a time:

Here are the results when we throw out 5 at a time:

There seems to be a trade-off in step 2: Fewer models need to be fit if we throw out more points at a time. However, this seems to come with a small penalty in terms of terms of the classification error on the final model.

So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try.


[1] The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge by Shai Ben-David

[2] Efficient Learning of Linear Perceptrons by Shai Ben-david, Hans Ulrich Simo

[3] Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent by L. Li and H.-T. Lin

Marginal Beliefs of Random MRFs

A pairwise Markov Random Field is a way of defining a probability distribution over some vector {\bf x}. One way to write one is

p({\bf x}) \propto \exp( \sum_i \phi(x_i) + \sum_{(i,j)} \psi(x_i,x_j) ).

Where the first sum is over all the variables, and the second sum is over neighboring pairs. Here, I generated some random distributions over binary valued variables. For each i, I set \phi(x_i=0)=0, and \phi(x_i=1)=r_i where r_i is some value randomly chosen from a standard Gaussian. For the pairwise terms, I used \psi(x_i,x_j) = .75 \cdot I(x_i=x_j). (i.e. \psi(x_i,x_j) is .75 when the arguments are the same, and zero otherwise.) This is an “attractive network”, where neighboring variables want to have the same value.

Computing marginals p(x_i) is hard in graphs that are not treelike. Here, I approximate them using a nonlinear minimization of a “free energy” similar to that used in loopy belief propagation.

Here, I show the random single-variate biases r_i and the resulting beliefs.  What we see is constant valued regions (encouraged by \psi) interrupted where the $\phi$ is very strong.

Now, with more variables.

Now, a “repellent” network. I repeated the procedure above, but changed the pairwise interactions to \psi(x_i,x_j) = -.75 \cdot I(x_i\not=x_j). Neighboring variables want to have different values.  Notice this is the opposite of the above behavior– regions of “checkerboard” interrupted where the $\phi$ outvotes \psi.

Now, the repellent network with more variables.