The replication crisis in psychology started with a bang with Daryl Bem’s 2011 paper Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. In this paper, Bem performed experiments that appeared to prove the existence of extrasensory perception (ESP). For example, subjects could guess (better than chance) where erotic images would appear on a screen. There were thousands of subjects. The paper used simple, standard statistics methods, and found significance for eight of nine experiments. Bem is a respected researcher, who no one suspects of anything underhanded.
The problem, of course, is that ESP isn’t real.
The psychology community was shaken. A smart person had carefully followed standard research practices and proved something impossible. This was ominous. Bem’s paper was only questioned because the result was impossible. How many other false results were lurking undetected in the literature?
Today, we’d say he had wandered the garden of forking paths: If you look at enough datasets from enough angles, you’ll always find significance eventually. It’s easy to do this even if you have the best intentions.
Many psychologists saw the Bem affair as a crisis and quickly moved to assess the damage. In 2014, an entire issue of Social Psychology was dedicated to registered replications. Each paper was both a replication (a repeat of a previous experiment) and pre-registered (the research plan and statistical analysis were peer-reviewed before the experiment was done).
So how well did existing results replicate? In their introduction to the issue, Nosek and Lakens are rather coy:
This special issue contains several replications of textbook studies, sometimes with surprising results.
Awesome, but which ones were surprising, how surprising were they, and in what way?
I wanted a number, a summary, something I could use as a heuristic. So, I read all the papers and rated every experiment in every paper as either a full success (★), a success with a much smaller effect size (☆), or a failure (✗). I used the summary from the authors if they gave one, which they often didn’t, probably because people are squeamish about yelling my famous colleague’s Science paper is bogus too loudly. Then, for each effect, I averaged the scores for all the experiments using ✗=0, ☆=2, and ★=4.
So, let’s play a game.
I’ll give a short description of the effects that were tested. You can predict how well each replicated by choosing an integer between 0 (failure) and 4 (full success). You can score yourself manually, or there’s a form you can use below.
(Or, if you’re the impatient type, feel free to jump to the summary.)
The first few effects are textbook results. (📖)
(1) The primacy of warmth is the idea that when forming impressions of others, the "warmth" of someone’s personality is more important than competence. This claim goes back to Solomon Asch in 1946 where people were given lists of various attributes for people like "intelligent, skillful, industrious, warm, determined, practical, cautious" and asked which was most important.
(2) Deviation-rejection is the idea that if someone stubbornly holds an opinion contrary to a group consensus will ultimately be rejected from that group. Schachter (1951) ran focus groups to discuss what to do with a juvenile delinquent. Most people are lenient, but a confederate in the group always supported a harsh punishment. This confederate was talked to less over time, and chosen less for future groups.
(3) The Romeo and Juliet effect is the idea that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972), who found that parental interference was correlated with love and trust in relationships and that, over time, increases in interference were correlated with an increase in love.
(4) Single exposure musical conditioning was investigated by Gorn (1982) by randomly showing people either a blue or beige pen, while playing either music they liked or disliked. Later, people could choose a pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, while only 30% of people who hear music they disliked did the same.
The next few effects concerned papers where the literature is somewhat in conflict or where there are moderators with unclear impact. (⚔️)
(5) Stereotype priming was investigated by Blair and Banaji (1996) who found that briefly flashing a stereotypically "male" or "female" word in front of someone changed how long it took to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found an effect even if the discrimination was unrelated to the primes: "Male" words made participants faster to recognize male names even if the task was to discriminate city names vs. first names.
(6) Stereotype susceptibility (or stereotype threat) is the idea that cognitive performance is influenced by stereotypes. Shih et al (1999). took Asian-American women and had them take a math test, while being primed to either think about being women or about being Asian. The Asian-primed group 54% of the questions right while the female-primed group got only 43% right. However, these results came from a small sample, and results in real-world applications have been mixed.
(7) Sex differences in distress from infidelity. A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at a population with a mean age of 20 years asked if people would be more distressed if their partner had passionate sex with someone else or formed a deep emotional attachment. 76% of young men chose sex, as opposed to 32% of women. In an older population with a mean age of 67 years, 68% of men chose sex, as opposed to 49% of women.
(8) Psychological distance and moral judgment is the idea that people make more intense moral judgments (positive or negative) for more psychologically distant acts. Eval et al. (2008) had participants make judgments about people who committed various acts (incest, eating a dead pet, donating to charity) varying if it happened in the near vs. distant future. People judged those in the distant future around 1.5 points more harshly on a -5 to +5 scale. However, Gong and Medin (2012) performed a similar experiment and came to the opposite conclusion.
Finally, there were recent studies that had attracted a lot of attention. (🔥)
(9) Does cleanliness influence moral judgments? Schnall et al. (2008) found that people make less severe moral judgments when they feel clean. They primed people with words and then asked people to judge a set of moral dilemmas, e.g. trolley problems or faking information on a resume. They found that people primed with "clean" words judged people around 1 point less harshly than those primed with neutral words, on a 7-point scale. In a second experiment, people saw a clip from Trainspotting with a man using an unclean toilet and were asked to wash their hands (or not) before being given the same moral dilemmas. Those who washed their hands judged people around 0.7 points less harshly.
(10) Does physical warmth promote interpersonal warmth? Williams and Barg (2008) gave participants either a cold pack or a heat pack to evaluate after which they could choose a gift either for themselves or for a friend. Those given the heat pack were around 3.5x more likely to choose a gift for a friend.
(11) Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable. Sachdeva et al. (2009) had participants write a short story about themselves using words that were either positive words ("caring", "generous") or negative words ("disloyal", "greedy"). Afterward, they asked how much they would be willing to donate to charity. Those who used the positive words donated a mean of $1.07, while those who used negative words donated $5.30.
(12) Can superstition improve performance? Damish et al. (2010) gave subjects a golf ball and asked them to attempt a 100cm putt. Subjects told the ball had been lucky so far were able to make 65% of putts, as opposed to 48% of controls.
(13) Is moral behavior connected to brightness? Banerjee et al. (2012) asked participants to describe something they did in the past that was either ethical or unethical, then asked them how bright their room was. Those in the ethical group described their room as around 0.6 points brighter on a 7-point scale. They also judged various light-emitting products (lamps, candles, flashlights) as around 2-point more desirable on a 7-point scale.
To play the game fill out this Google form quiz thing. f
Or, you can score yourself. In the following table, record your absolute error for each effect. (If you guessed 1 and the correct answer was 4, you get 3 points.) Then add all these errors up.
The best possible score is 0, while the worst is 41.
Here are the results for each effect, ranging from ○○○○ (failure) to ●●●● (full replication).
|Primacy of warmth||📖||○○○○|
|Romeo and Juliet effect||📖||○○○○|
|Single exposure musical conditioning||📖||●●●●|
|Sex differences in distress from infidelity||⚔️||●○○○|
|Psychological distance and moral judgment||⚔️||●●○○|
|Cleanliness and moral judgments||🔥||○○○○|
|Physical warmth and interpersonal warmth||🔥||○○○○|
|Superstition and performance||🔥||○○○○|
|Moral behavior and brightness||🔥||○○○○|
What to make of this?
Obviously, the replication rate is strikingly low. Only one effect replicated at the original effect size, and even this has some caveats (below). The average replication score was 0.8 / 4. And one effect (Romeo and Juliet) deserves a negative score: It didn’t just fail to replicate, there was a large effect in the opposite direction.
Also striking is that all of the new hot effects (🔥) failed—not one experiment in one paper replicated even at a smaller effect size. Strangely, all the conflicted papers (⚔️) replicated at least a little bit.
I don’t think things are nearly as bad as this table suggests. For one thing, if you were a psychologist, what would make you perform a registered replication? Personally, I’d be motivated by something like this:
I’d do it because I thought some famous effect wasn’t real. So this is a highly non-random sample. The correct conclusion is less psychology is all wrong and more when psychologists think an effect is bogus, they’re usually right.
Second, these results are mostly social psychology. You, dear reader, are probably not a social psychologist, but rather someone interested in the broad art of predicting stuff, which is the domain of cognitive psychology. (At least, unless you’re predicting how people interact in groups.) The Open Science Collaboration in 2015 tried to reproduce a bunch of recent papers in social and cognitive psychology. They found that around 28% of social effects subjectively "replicated", as opposed to around 53% of cognitive effects. They also found that replication effect sizes were much closer to the original effect sizes for cognitive psychology (a ratio of around 0.56 as compared to 0.31 in social psychology).
Here are more details about what happened in each of the replications, and justifications of the ratings I gave. (Click to expand.) If you’re the author of one of these replications and I got anything wrong, I’d love to hear from you.
The primacy of warmth
This claim goes back to Solomon Asch in 1946. The idea is that, when forming impressions of people, warmth-related judgements are more important than competence. Nauts et al. replicated Asch’s experiments. They showed different people various lists of traits, such as the following
- (Condition 1) Intelligent, skillful, industrious, warm, determined, practical, cautious
- (Condition 2) Intelligent, skillful, industrious, cold, determined, practical, cautious
- (Condition 3) Obedient, weak, shallow, warm, unambitious, vain
- (Condition 4) Vain, shrewd, unscrupulous, warm, shallow, envious
- (Condition 5) Intelligent, skillful, sincere, cold, conscientious, helpful, modest
The fraction of people that chose "warm" or "cold" as the most important trait was as follows:
|Condition||most chosen trait||choosing warm/cold|
According to Asch’s theory, people should choose "warm" and "cold" as the most important traits in conditions 1 and 2, but not in the others. It is true that warm/cold were considered more important in these conditions, but they were never the most common choice as the most important trait. Nauts et al. call this a clear failure of this particular experiment to replicate, though they emphasize that other research makes clear that warmth is indeed primary in many circumstances.
Summary: ✗ Total replication failure.
Wesselmann et al. investigated "deviation-rejection", or the claim that someone who holds an opinion contrary to a group consensus will ultimately be rejected from that group. Following Schachter (1951), they created groups of 10 people, consisting of 7 experimental subjects and three confederates. Everyone was given a case study of a juvenile delinquent named Johnny Rocco, then asked everyone how Johnny should be treated, followed by a discussion. Most subjects are lenient. The mean confederate followed the current group consensus. The slider confederate first supported harsh treatment, then gradually shifted towards leniency. The deviate confederate always said to punish Johnny and never changed.
They looked at three claims made by Schachter. First, did people communicate with the deviate confederate less over time? They did seem to, though the data was noisy.
Second, they looked at if people would assign the confederates to leadership roles in future committees. Contra Schachter, they found no effect.
Third, they had people rank each person for how much they’d like to have them in future groups. On a scale of 1-3, the slider got a score of 1.74, the mode a score of 1.91, and the deviate a score of 2.34. (So people dislike nonconformists, but like people who are willing to change their mind?) This replicates Schachter again, albeit with a significantly smaller effect size.
Summary: ☆ Replication mostly succeeded, but with a smaller effect size
The Romeo and Juliet effect
The Romeo and Juliet effect is the claim that parental opposition to a relationship will increase the love and commitment in that relationship. This was first suggested by Driscoll et al. (1972). Who found that parental interference was correlated with love and trust, and also that over time, increases in parental interference were correlated with increases in love.
Sinclair et al. replicated these experiments. They found people online who were in relationships and asked them about their relationship, e.g. how much they loved their partner, how much they trusted their partner. They also asked how much friends or parents disapprove of their relationship. They contacted these people again around 4 months later. They then looked at correlations.
Their results were the opposite of what Discoll et al. found. Greater approval from friends and parents was associated with higher relationship quality, not lower. And increased parental disapproval was correlated with decreased love.
Summary: ✗ Calling this just a failure is generous. The original effect not only failed to replicate, but the opposite effect held (with a large effect size and high statistical significance).
Single-exposure musical conditioning
Gorn (1982) showed were randomly shown either a blue or beige pen, while playing either music they liked or disliked. Then, later, they could choose one pen of either color to keep. This experiment found that 79% of people who heard music they liked chose the pen shown on screen, compared with only 30% of people who heard music they disliked.
Vermeulen et al. reproduced this result. In a first experiment, they used similar music to the original music from 1982: They first used Summer Nights from Grease as the "liked" music, and Aaj Kaun Gali Gayo Shyam by Parveen Sultana as the "disliked" music. They confirmed that people really did like the Grease music more than the other song (they had mean ratings of 3.72 vs 2.11 on a scale of 1-7)
After this, they repeated the same study, but with an actor pretending to be a researcher from another department and a post-experimental questionnaire. This again found no real effect.
Finally, maybe the problem was just using old music that college students are unfamiliar with? To test this, they got two different renditions of Rihanna’s We Found Love by cover artists, and selected one (that people liked) as the "liked" music and one (that people didn’t like) as the "disliked" music. People liked the good rendition much more than the other (mean score of 5.60 vs. 2.48).
For this third experiment, they ran out of students in the class, and so had fewer subjects than planned. Still, they found that 57% of people who heard the "liked" music chose the pen on screen, as opposed to only 23% of people who heard the "disliked" music. Despite the smaller sample, this was still highly significant, with p=.003.
It’s unfortunate that they ran out of subjects for the third experiment. Still, I think this mostly rescues the effect: Students didn’t really like the music from Grease in the first two experiments, they just disliked it slightly less (3.72 is a low score on a scale of 1-7!) The third experiment is the only one where there’s a big difference in how much people like the music, and there’s a big effect there. It’s unfortunate they ran out of subjects, though!
Summary: ★? The authors call this a "somewhat unreliable successful replication".
Blair and Banaji (1996) claimed that if you briefly flashed a stereotypically "male" or "female" word in front of someone, that would change how long it would take people to discriminate between male or female first names. Additionally, Banaji and Hardin (1996) found this could have an effect even if the discrimination was unrelated to the primes (e.g. discriminating cities vs. names).
Müller and Rothermund set out to replicate these effects. They had people come into the lab and fixate on a screen. Then, they’d briefly be shown either stereotypically "male" or "female" priming words. Example male primes are "computer", "to fight", and "aggressive", while example female primes are "ballet", "to put on make-up", and "gossipy". These primes were shown for only 200 ms.
In a first experiment, the prime was followed by either a male name ("Achim", this happened in Germany) or a female name ("Annette"), which subjects needed to classify as quickly as possible. Here were the mean times (and standard deviations) in ms
|Target Gender||Male Prime||Female Prime|
|male||554 (80)||566 (80)|
|female||562 (83)||549 (80)|
There was a significant effect—albeit a small one. However, Blair and Banaji also found a small effect, although around 2x or 3x larger than this one.
A second experiment was the same, except now they would either see a (male or female) first name 50% of the time, and a city name (e.g. "Aachen") 50% of the time. Now, subjects needed to distinguish a first name from a city name.
|Target Gender||Male Prime||Female Prime|
|male||605 (91)||605 (90)|
|female||570 (86)||567 (85)|
In this analysis, they simply ignore all trials where a city was shown, so this table is showing how long it takes to recognize male/female names as names. For whatever reason, people found it to be harder to recognize male names, but priming had no effect on this. In contrast, Banaji and Hardin had found that changing the prime would have an effect of around 14 ms.
Summary: ☆ / ✗ Half the replication failed, the other half succeeded with a smaller effect size.
Shih et al. (1999) took Asian-American women and had them take a math test. Before the math test, some were primed to think about being women by being given questions about coed or single-sex living arrangements. Others were primed to think about being Asian by answering questions about family and ethnicity. They found that the Asian-primed group 54% right while the female-primed group got 43% right. (The control group got 49%.)
Gibson et al. replicated this at six universities in the Southeastern US. With a sample of 156 subjects (as opposed to only 16) they found that the Asian-primed group go 59% right, while the female-primed group got 53% right. This was smaller and nonsignificant (p=.08). They then excluded participants who weren’t aware of stereotypes regarding math and Asians/women. Under the remaining 127 subjects, the Asian-primed group got 63% right, while the female-primed group got 51% right, and the effect was significant (p=.02).
But then, in a second article, Moon and Roeder tried to replicate exactly the same result using the same experimental protocol. They found that the Asian-primed group got 46% correct, while the female-primed group got 43% correct. This difference was nonsignificant (p=.44). However, in this same experiment, the control group got 50% correct.
Among only those aware of the stereotype, the Asian-primed group got 47% correct, while the female-primed group got 43%. Both of these results were nonsignificant (p=.44, and p=.28, respectively). Here again, the control group got 50%. The higher performance in the control group is inconsistent with the theory of priming, so this is a conclusive failure.
Summary: ☆? / ✗ The first replication basically half-succeeded, while the second failed.
Sex differences in distress from infidelity
A common idea in evolutionary psychology is that males are more upset by sexual infidelity than females. Shackelford et al. (2004) looked at datasets from two populations, one with a mean age of 20 years and one with a mean age of 67. They found that in both populations males were more distressed by infidelity than females, though the difference was smaller in the older population.
Hans IJzerman et al. replicated these experiments. In the younger population, they successfully replicated the result. In the older population, they did not replicate the result.
Summary: ☆ / ✗ One successful replication with a smaller effect, and one failed replication.
Psychological distance and moral judgment
Žeželj and Jokić replicated this experiment. They had subjects make judgments about the actions of people in hypothetical scenarios. In a first experiment, they described incest or eating a dead pet, but varied if they happened now or in the distant future. Contra Eval et al., future transgressions were judged similarly to near-future transgressions. (Near future was judged 0.12 points more harshly on a -5 to +.5 scale, so the effect was actually in the wrong direction.)
In a second experiment, they instead varied in if subjects were asked to think in the first-person about a specific person they knew performing the act, or to focus on their thoughts and think about it from a third-person perspective. All scenarios showed that people were more harsh when thinking about things from a distance. The difference was around 0.44 averaged over scenarios. This magnitude was significant and similar to what Eval et al.’s research predicted.
A third experiment was similar to the first in that time was varied. The difference was that the scenarios concerned virtuous acts with complications, e.g. a company making a donation to the poor that improves their sales. They actually found the opposite effect that Eval et al. would have predicted: the distant future acts were judged less virtuous. The difference was only 0.32 and not significant.
In a fourth experiment, participants varied in if they were primed by initial questions to be in a high-level or low-level mindset. Here, they found that those primed to be a low-level were more harsh than those at high-level. This was statistically significant and consistent with the predictions of Eval et al., albeit around half the magnitude of effect.
Summary: ✗ ✗ ☆ ★ Two clear failures, one success with a smaller effect, and one success with a similar effect. This should be an average score of 1.5/4, but to keep everything integer, I’ve scored it as 2/4 above.
Cleanliness and moral judgments
Schnall et al. (2008) claimed that people make less severe moral judgments when they feel clean.
Johnson et al. replicated these experiments. Participants first completed a puzzle that had either neutral words or cleanliness words, and then responded to a sequence of moral dilemmas. They found no effect at all from being primed with cleanliness words.
In a second experiment, they watched a clip from Trainspotting with a man using an unclean toilet. They were then asked to wash their hands (or not) and then asked about the same moral dilemmas. They found no effect at all from being assigned to wash your hands.
Summary: ✗ Clear failure
Physical warmth and interpersonal warmth
Williams and Barg (2008) published an article in Science that claimed that people who were physically warm would behave more pro-socially.
Lynott et al. replicated this. Participants were randomly given either a cold pack or a heat pack to evaluate, and then could either choose a gift for a friend or for themselves. Williams and Barg found that those given heat were around 3.5x as likely to be pro-social. In the replication, they were actually slightly less likely.
Summary: ✗ Clear failure
Moral licensing is the idea that someone who does something virtuous will later feel justified to do something morally questionable.
Blanken et al. reproduced a set of experiments by Sachdeva et al. (2009) In a first experiment, they had participants were induced to write a short story about themselves using words that were either positive, neutral, or negative. Afterward, they asked how much they would be willing to donate to charity. Contra previous work, they found people with positive words were willing to donate slightly more (not significant).
A second experiment was similar except rather than being asked to donate to charity, participants imagined they ran a factory and were asked if they would run a costly filter to reduce pollution. Again, if anything the effect was the opposite of predicted, though it was non-significant.
In a third experiment, they used an online sample with many more subjects, and asked both of the previous questions. For running the filter, they found no effect. For donations, they found that there was no difference between neutral and positive priming, but people who were negatively primed did donate slightly more, and this was statistically significant (p=.044).
Arguably this is one successful replication, but let’s be careful: They basically ran four different experiments (all combinations of donations / running-filters and in-person / online subjects). For each of these they had three different comparisons ( positive-vs-neutral / positive-vs-negative / neutral-vs-negative). That’s a lot of opportunities for false discovery, and the one effect that was found is just barely significant.
Summary: ✗ ✗ ✗? Two clear failures and one failure that you could maybe / possibly argue is a success.
Superstition and performance
Damish et al. (2010) found that manipulating superstitious feelings could have dramatic effects on golfing performance. Subjects told that a ball was lucky were able to make 65% of 100cm putts, as opposed to 48% of controls.
Calin-Jageman and Caldwell reproduced this experiment. They found that the superstition-primed group was only 2% more accurate, which was not significant.
In a second experiment, they tried to make the "lucky" group feel even luckier by having a ball with a shamrock on it and saying "wow! you get to use the lucky ball". Again, there was no impact.
Summary: ✗ Clear failure
Moral behavior and brightness
Banerjee et al (2012) found that recalling unethical behavior caused people to see the room as darker.
Brant et al. replicated this. Participants were first asked to describe something they did in the past that was either ethical or unethical. In a first study, they were then asked about how bright their room was. In a second study, they were instead asked how desirable lamps, candles, and flashlights were.
They found nothing. Recalling ethical vs. unethical behavior had no effect on the estimated brightness of the room, or how much people wanted light-emitting products.
Summary: ✗ Clear failure