December 15, 2017

Stereotype Threat

The Idea

In 1995, Claude Steele and Joshua Aaronson wrote a paper entitled “Stereotype Threat and the Intellectual Test Performance of African Americans”.

In it, they gave university students a 30-minute test. In the control condition, the test was described as a laboratory problem-solving test. And on that, the black students and the white students performed exactly as their SAT scores would predict.

But then in the “stereotype threat” condition, the test was described as a diagnostic of intellectual ability. And on that the blacks scored significantly lower than their SAT scores would predict.

This effect has been replicated probably hundreds of times, and at the time was the source of great hope. And this is the essence of stereotype threat: take students, and in one condition have “no stereotype threat”, and in the other have “stereotype threat”. In the stereotype threat condition, minorities (women, blacks, old people, etc.,) will somehow be reminded of a stereotype about how their minority preforms on the task and hand, this will make them anxious, and this will make their performance suffer. Or so the theory goes.

And the “stereotype threat” condition can have multiple degrees, from saying that it’s a reasoning test, to an intelligence test, to using gendered language, to using very white language, to straight up saying that whites score better on this intelligence test than blacks do.

So the next obvious question to ask is, “does this effect exist in the real world, or is it confined to the laboratory?”

Natural Experiments

Since it would be unethical to create a stereotype threat effect in a real-world, real-stakes test, the only “field data” on stereotype threat comes from natural experiments in which race and gender “primes” happened to be on some of the tests, but not others. This occurred on the NAEP from 1978 to 1999, on the AB calculus test and the community college Computerized Placement Tests from 1996.

NAEP – A 2009 paper entitled “Stereotype Threat, Gender, and Math Performance: Evidence from the National Assessment of Educational Progress” by Thomas Wei. He analyzed NAEP tests issued from 1978 to 1999. During these years there was a design quirk where some tests asked the students their gender, some did not. On the tests, the questions were:

“How do you feel about this statement: math is more for boys than girls. Do you strongly disagree, disagree, undecided, agree, or strongly agree?”

The other two statements the test-takers were asked what they felt about were, “fewer men have logical ability than women” and “math is more for girls than for boys”.

In this he found precisely zero evidence of stereotype threat effects at any year the NAEP was issued.

Two more natural experiments existed, but with race primes, on the AB Calculus test, and Computerized Placement Tests for community college course placement. The “experimental” condition involves scores for when some sort of gender prime was invoked, and the control condition when there was none.

AB Calculus

Experimental Control
White 19.21 18.92
Black 15.69 14.59
Asians 20.4 20.35

Computerized Placement Tests

Experimental Control
White 72.16 72.31
Black 56.58 55.8

And so in all 3 major natural experiments for stereotype threat, there was no significant effect of primes on any group.

Effect of Payment (Gender)

Roland Fryer and Steven Levitt conducted an experiment for men and women, in which they paid men and women money for getting correct answers. In it, the gender prime was

This is a diagnostic test of your mathematical ability. As you may know, there have been some academic findings about gender differences in math ability. The test you are going to take today is one where men have typically” outperformed women.”

What they found was that there was that they could not ilicit a stereotype threat effect when they paid the subjects for correct answers.

Later, in an interview John List, who worked on the experiment with Fryer and Levitt, said:

List: I believe in priming. Psychologists have shown us the power of priming, and stereotype threat is an interesting type of priming. Claude Steele, a psychologist at Stanford, popularized the term stereotype threat. He had people taking a math exam, for example, jot down whether they were male or female on top of their exams, and he found that when you wrote down that you were female, you performed less well than if you did not write down that you were female. They call this the stereotype threat. My first instinct was that effect probably does happen, but you could use incentives to make it go away. And what I mean by that is, if the test is important enough or if you overlaid monetary incentives on that test, then the stereotype threat would largely disappear, or become economically irrelevant. So we designed the experiment to test that, and we found that we could not even induce stereotype threat. We did everything we could to try to get it. We announced to them, “Women do not perform as well as men on this test and we want you now to put your gender on the top of the test.” And other social scientists would say, that’s crazy — if you do that, you will get stereotype threat every time. But we still didn’t get it. What that led me to believe is that, while I think that priming works, I think that stereotype threat has a lot of important boundaries that severely limit its generalizability. I think what has happened is, a few people found this result early on and now there’s publication bias. But when you talk behind the scenes to people in the profession, they have a hard time finding it. So what do they do in that case? A lot of people just shelve that experiment; they say it must be wrong because there are 10 papers in the literature that find it. Well, if there have been 200 studies that try to find it, 10 should find it, right? This is a Type II error but people still believe in the theory of stereotype threat. I think that there are a lot of reasons why it does not occur. So while I believe in priming, I am not convinced that stereotype threat is important.


A 2012 meta-analysis by Gijsbert Stoet and David Geary entitled “Can Stereotype Threat Explain the Gender Gap in Mathematics Performance and Achievement?” looked at the attempts made at the time to replicate a highly influential 1999 paper on gender differences in math scores. In it were two kinds of replication attempts – ones which took men and women with different scores and statistically controlled them, and one which compared men and women with real equal scores.

Percent replicated using real equal scores 55.00%
Percent replicated using adjusted equal scores 30.00%

The paper “Under What Conditions? Stereotype Threat and Prime Attributes”, Thomas Wei looked at 64 studies on race and gender and stereotype threat. The percent of studies which showed an effect (weighted by sample size) was only 58.4% – meaning in the laboratory experiments, a stereotype threat effect was produced only 58.4% of the time.

At the end of the paper, the researchers wax poetic a bit about what they think about stereotype threat as an explanation for male-female math differences:

We list just a few more examples of typical overconfident statements. Good, Aronson, and Harder (2008) state “It is well established that negative stereotypes can undermine women’s performance on mathematics tests.” Similarly, Eriksson and Lindholm (2007) note “It is well established that an emphasis on gender differences may have a negative effect on women’s math performance in U.S.A., Germany and the Netherlands.” Davies et al. (2002) wrote “Women in quantitative fields risk being personally reduced to negative stereotypes that allege a gender-based math inability. This situational predicament, termed stereotype threat, can undermine women’s performance and aspirations in all quantitative domains.” In summary, there is a mismatch between the strength of the stereotype threat explanation of the gender difference in some areas of mathematics and the way many researchers describe it in the abstracts of their scientific publications. This might well have contributed to the common misrepresentation of stereotype threat hypothesis in the popular media. We think there are two possible reasons for the misrepresentation of the strength and robustness of the effect. On the one hand, we assume that there has simply been a cascading effect of researchers citing each other, rather than themselves critically reviewing evidence for the stereotype threat hypothesis. For example, if one influential author describes an effect as robust and stable, others might simply accept that as fact.”

Publication Bias

Another issue with stereotype threat is the possibility of publication bias. Publication bias is when researchers are more likely to publish certain results on a given topic. And usually they are biased to publish positive results. For stereotype threat, they would be biased in favor of showing a stereotype threat effect. The most straightforward way of detecting publication bias is by looking at published and unpublished studies.

The obvious problem is that unpublished studies aren’t published, and so the anyone trying to find publication bias must find unpublished manuscripts and compare them to the published data.

Two attempts that I know of were done by Ganley et. al in 2013, and Wicherts in 2015, and combined they found 9 unpublished manuscripts, and of those 2 showed a stereotype threat effect, and 7 showed no effect:

Analysis Number with ST effect Number w/o ST effect
Wicherts 2 3
Ganley et al. 0 4

Another sign of publication bias in stereotype threat is that there aren’t enough studies being published that show zero effect.  Studies on the same thing vary in the effect sizes they report. As explained here, this distribution of effect sizes should be roughly normally distributed around the mean effect size. Moreover, how widely distributed around the mean the effect size distributions should be can be estimated by how varied individuals within each store are in the reaction to the experiment. Thus, if there is no publican bias, we can predict that a certain number of zero or negative effect sizes should occur in a literature given the mean effect size and the amount of individual variables within studies.

If we saw a ton of very weak effect sizes and lots of individual variation within each study, but little to no studies reporting no effect, that would suggest that such studies weren’t being published, and would therefore be evidence of publication bias.

This is exactly what was found by Jelte Wicherts when he examined the literature on stereotype threat. Basically, there are too many “hits” for the effect size and variation within each study, which is a sign of publication bias.

If we include the unpublished studies we know of in Wei’s meta-analysis, the percent of n-weighted studies showing a stereotype threat effect is only 53.9%. If there exist 9 more unpublished studies out there with the same kind of results as the ones we know about, then the “hit rate” for stereotype threat would fall below 50%.

Predictive Validity and Prior Baselines

The real kicker on stereotype threat, and why it can’t explain group differences, is that it is a NEGATIVE effect relative to a prior baseline.

They always take selected demographic groups, matched for SAT scores or some other test scores, and then by using some sort of “prime”, push down the black or hispanic from what you would expect based on their real world score. And they always run a control, which has no race or gender prime, which almost always shows scores in line with each group’s prior SAT scores.

What this means is that the control group (no stereotype threat) in these laboratory experiments is precisely the group that most closely matches the real world scores! The control group in Steele and Aaronson scored at precisely what their prior SAT scores would predict!

But beyond that, analysis of SAT scores and college performance show that racial groups perform roughly the same in college for the same SAT score – in fact they slightly overpredict black and hispanic college performance relative to whites. So if they face stereotype threat on the SAT, do they also face the same relative amount of stereotype threat on average on all academic tasks in their college career?

The same thing can be said about ASVAB test scores by race and performance in the military, and about IQ tests and high school grades. They all predict roughly the same for all racial groups, if not slightly overpredicting for blacks and hispanics.

Now the one knock on this is that females perform better than males in college for any given SAT score – does that mean female scores are depressed by stereotype threat? Well all the other evidence says not, and there are some plausible explanations for why females would get better grades for any given SAT score that could be explored: more diligence, better organization, more conformity, less interest in videogames, etc.

The only way I can see one squaring the predictive validity of stereotype threat, with the fact that real-world scores are the same as the control condition scores in these lab experiments – would be to claim that stereotype threat is so pervasive that it exists even in the “control” example and in the real world example – but it can never be measured because it’s always there and there isn’t a “no stereotype threat” condition anywhere in reality.

Such a view, however, is completely untestable and thus completely unscientific.

Facebook Comments