Problems in the pipeline: Stereotype threat and women's achievement in high-level math courses

Social forces, such as stereotype threat, can cause women to underperform men in math examinations. This achievement gap can be closed or even reversed when strategies are implemented during testing that eliminate this threat, such as including statements at the beginning of an exam that indicate both genders tend to perform equally well on it.


There is growing evidence that negative stereotypes of one’s social group can hinder performance in academic, professional, and other settings. This phenomenon, known as “stereotype threat”, has been studied as one reason that women underperform men in math tests, and later are underrepresented in STEM fields. Much of the existing experimental research investigating this gender achievement gap is based on interventions that are not easily applicable outside of the laboratory setting. For example, interventions such as having students watch videos before math testing or perform verbal exams are often difficult to put into practice in classroom settings.

In order to understand the impact of stereotype threat on women’s math performance, the researchers in this experimental study focus on high-level mathematics students who appear to be on a career pipeline to mathematics and science professions. They use a natural classroom setting, an advanced calculus college course, and typical event, a practice test for an upcoming math exam, to evaluate the impact of an intervention aimed to eliminate stereotype threat on women’s math performance. Specifically, the researchers measure the difference in performance between men and women students that received a regular diagnostic test, and those that receive a test including a statement at the beginning designed to eliminate stereotype threat.


Students from an undergraduate calculus course performed worse on a math test under a gender stereotype “threat” condition—here, a typical university-administered academic test—than when under a “non-threat” condition, where the same test includes a preamble acknowledging equal performance between genders on the test in the past. These results were amplified among white women, where those in the non-threat condition significantly outperformed those in the traditional test-taking environment.

  • Among white participants, women in the non-threat condition significantly outperformed women in the stereotype threat condition (mean scores of 4.44 and 2.85 out of 15 items), and men in either condition (mean scores of 2.7 and 2.89 out of 15 items)
  • Overall, participants in a non-threat condition performed more accurately (correct answers measured out of completed answers) than participants in a stereotype threat condition
  • Women in the non-threat condition performed significantly more accurately than women in a stereotype threat condition as well as men in the stereotype threat condition
  • Participants in the non-threat condition reported higher confidence than those in the stereotype threat condition
  • Male participants reported higher confidence than female participants in either condition
  • Women’s course grades significantly under-predicted their performance on the practice test in the non-threat condition (by 1.28 points)

Women in the non-threat condition outperformed women and men in the threat condition, effectively showing that the stereotype threat hindered women’s performance and potentially understates women’s math grades. These findings are more pronounced for white women who, under the non-threat condition, outperformed all men.  The authors suggest that a typical diagnostic test induces this stereotype threat, though a simple preamble nullifies this threat and should be considered as a strategy for educators looking to help close this gender gap.


In order to narrow findings to those within the mathematics and sciences career pipeline, participants were selected from an advanced college calculus course required for natural sciences, engineering and mathematics degrees. Of the 174 students who participated, only 157 reported their sex—100 male and 57 female—and were ultimately used to calculate the findings. Participants were randomly assigned one of two conditions in a 2 by 2 factorial design: women or men were either assigned the stereotype threat condition or the non-threat condition.

Participants were presented the test as an optional practice test for extra credit by the professor of their calculus course. The test, which was pilot tested with 12 advanced calculus students from a similar course, was designed based on the Graduate Record Exam (GRE) and covered the same content as the participants’ advanced calculus course. On the day of the practice test, the teaching assistant introduced the researcher, who presented the test as a diagnostic exercise covering content from their course that would count as extra credit. Test packets were handed out to the students, with the experimental conditions randomly assigned inside each packet. The stereotype threat condition was characterized by a test including an opening statement that outlined the test as one designed to measure their math abilities. This condition was designed to emulate a typical test. The non-threat condition was designed to nullify the stereotype threat. It consisted of an identical opening statement with an additional paragraph stating that the test had not revealed any differences in performance based on gender, and that men and women performed “equally well” on the practice test. Before beginning the test, the researcher asked the participants to review an example question and rate their confidence in answering the question on a 5-point scale.

Participants were given 20 minutes to complete the test. This short time frame was designed to pressure working memory and has been shown to be additionally hindered in stereotype threatening contexts. After completing the test, the participants completed a demographic survey as well as a manipulation check, answering questions such as “This test was biased” on a 15-point scaled. Additionally, researchers collected the overall course grades for the participant students at the end of the semester.

Related GAP Studies