Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation

In male-dominated fields, quantitative performance ratings for judging a professor’s merit elicit more gender bias when ratings are assessed on a 10-point scale than when assessed on a 6-point scale.

Introduction

Performance ratings are important tools in modern organizations for judging merit and are widely used in businesses and universities. They also heavily influence organizational promotions and career advancement for employees. However, evidence suggests that such tools may elicit bias against women. On average, people rate male workers as significantly more capable, likeable, and worthy than female workers, even they exhibit identical qualifications, performance, and behaviors. As such, this gender gap in how women are assessed relative to men is a major factor contributing to the gender gaps in the hiring, promotion, and pay for women relative to men.

These disparities also exist in academia, where studies have shown that women consistently receive much lower performance ratings, are perceived as less competent, and are given less credit for their performance than men, even when they exhibit identical behaviors and skill levels as men. As a result of the gender disparities in professional evaluations in academia, women are much more likely to receive adjunct or non-tenure line positions, with lower levels of pay, prestige, and job security.

Despite ample evidence that evaluations can substantially contribute to gender inequalities in business and academia, there is little research on how to revise the architecture of evaluations (that is, the design of the tools used for evaluation) in order to curb effects of gender inequity. In this study, researchers investigated how an aspect of the evaluation architecture—whether ratings are assessed on a 10-point scale or a 6-point scale—can impact the size of the gender gap in course evaluation outcomes in a university setting.

Findings

In male-dominated fields, quantitative performance ratings for judging merit elicit more gender bias when ratings are assessed on a 10-point scale than when assessed on a 6-point scale.

  • In fields that are closer to gender parity (which consist of 29 to 39% women), there appear to be no gender gap in the evaluation of women, as compared to men, either on a 10-point scale or a 6-point scale.
  • In highly male-dominated fields (which consist of 11 to 15% women), men are 1.6 times more likely to receive a perfect score in a 10-point scale (a 10/10 rating) than women. In these fields, a 10/10 rating was the most common rating given to men, whereas an 8/10 rating was the most common rating given to women.
    • This gender gap is likely due to how a perfect 10 rating is associated with gender stereotypes of being exceptional, brilliant, or excellent, which are more commonly associated with men than with women. Men are more than twice as likely to be described as “exceptional,” phenomenal,” “fantastic,” “perfect,” and “awesome” than women.
  • In highly male-dominated fields, when the evaluation system was switched to a 6-point scale, the gender gap in the evaluation of men and women disappeared.
    • Changing to a 6-point scale for evaluations may be a successful intervention because 6-point rating scales are less likely to elicit gender stereotypes of exceptional performance than 10-point rating scales.
    • People who give someone a 10/10 rating described that person with a superlative descriptor (“exception,” “phenomenal,” etc.) 54.2% of the time, whereas people giving someone a 6/6 rating only used such descriptors 28.6% of the time.
  • Although shifting to a 6-point scale helped eliminate previously wide gender gaps in performance evaluations, the survey experiment’s results also showed that it did not eliminate underlying gender bias, in that even with structural changes to the evaluations, differences in the underlying qualitative perceptions of male and female instructors persisted.
    • Given identical lecture transcripts that randomly varied instructor gender, 15.5% of participants agreed strongly that the professor was brilliant when they believed the instructor to be male; when believing the instructor to be female, only 9.5% expressed the same view.

These findings highlight how minor aspects of evaluations design and the cultural meanings attached to specific numbers can impact gender disparities. Revising the design and numbers used in an organization’s rating systems can help curb inequitable consequences for gender in workplaces.

Methodology

In this study, researchers conducted a quasi-natural experiment and a survey experiment to assess the impact of a shift from a 10-point to a 6-point assessment scale on gender biases in merit evaluations. In the quasi-natural experiment, the researchers used existing course evaluations at a large North American university, where a 10-point assessment scale was used for twenty cycles, and then a 6-point assessment scale was used for nine cycles. The data was comprised of 105,034 student ratings of 369 instructors in 235 courses total. The courses fell into one of eight subject areas. Overall, 24.4% of the instructors were female. However, half of the subject areas were extremely male dominated (consisting of 11 to 15% female instructors), whereas the other half of the subject areas were closer to gender parity (consisting of 29 to 39% female instructors). The researchers calculated the impact of the change from the 10-point scale to the 6-point scale on the size of the gender gap in how students evaluated the instructors. One of the major benefits of this sample is that it allowed researchers to compare the course evaluations given to the same teacher teaching the same course across the 10-point to 6-point change, allowing them to eliminate other confounding variables in the data.

In the survey experiment, researchers presented an identical lecture transcript to respondents, modifying only the name of the instructor (which implicitly indicated the instructor’s gender) and whether the student was asked to assess the instructor on a 10-point or 6-point scale. Participants were randomly assigned one of each of these two factors (instructor gender and number of points on the assessment scale). The 400 participants for the survey experiment were drawn from Survey Monkey’s Audience and Survey Sampling International pool. In order to narrow the pool of participants in this study, participants were selected only if they met the following criteria: they were degree-seeking students enrolled in on-campus programs, and their institution was among the top-100 degree programs in the United States, as ranked by the U.S. News and World Report in 2018. The lecture transcript the participants read was modified from a TED talk about the social and economic implications of technological change. This topic was selected because technology and economics are traditionally male-dominated fields.

After the students assessed the instructor based on the lecture transcript on a 10-point or 6-point scale, they were asked to write down the words they associated with the instructor’s teaching performance. Researchers used these responses to understand the underlying mechanisms of why respondents rated the lecturers as they did. Future research could expand the scope of this study by examining how other elements of evaluation architecture can affect inequalities in workplace evaluations, and how evaluation architecture impacts race, sexual orientation, disability status, and parenthood.

Related GAP Studies