Teaching is about to start again for universities in New Zealand. Classes at Waikato start on 2 March. So it seems like an opportune time to talk about how we measure teaching quality, and in particular the ways that measurement is going wrong. The standard approach to measuring teaching quality is to ask students to complete an evaluation, often at the end of a paper or course. Those student evaluations of teaching (SETs) usually involve rating the teacher, and the paper, on some scale, and across one or more criteria. The scores are then combined (there are various ways to do this) to give an overall measure of teaching quality.
There is some ideological support for asking students to rate teaching quality. If you view education as a consumption activity, then students are consumers, and the service provider (the university) wants to know about the experience of their customers. However, the theoretical support for this position is shaky. Education is not a consumption activity - it is a production activity. Education produces human capital, as well as a signal of quality to future employers. At the time that students complete a particular paper, they are in no position to evaluate the quality of that production, because they are not yet making use of it. It would be like asking a car buyer to rate the quality of spark plugs in their vehicle, before they've even had a chance to take it for a drive.
Students are not in a strong position to rate the quality of their education, perhaps until years after that education is complete. And I say this as someone who routinely gets outstanding teaching evaluations (and has the multiple teaching awards over the last decade to add substance to that claim).
You may doubt me, but research on SETs backs me up. If students were good at evaluating teaching, then we wouldn't expect to see systematic biases appear in teaching evaluations. So, if SETs routinely rate female lecturers worse than male lecturers, you have the choice of either arguing that it results from female lecturers generally being worse (on average) than male lecturers,
or that SETs are biased. And if SETs are biased (which seems like the more valid claim), then it provides evidence that SETs are not a good measure of teaching quality.
There's lots of evidence for gender bias in SETs. I've read several papers that attest to this, just in the last few years, and I'll outline some of them below.
Let's start with
this 2016 article, by Natascha Wagner, Matthias Rieger, and Katherine Voorvelt (all Erasmus University Rotterdam), published in the journal
Economics of Education Review (ungated earlier version
here). They use data from MA students enrolled at the International Institute of Social Studies at Erasmus University from 2010/11 to 2014/15, which included 688 teaching evaluations across 272 courses. Interestingly, the response rate to the teaching evaluations was 87%, much higher than many other institutions achieve. They find:
...significantly lower scores in teaching evaluations for women compared to men, but only once we control for course unobservables. In other words, the documented associations insinuate that teacher evaluations are not gender blind, and gender effects explain roughly one fourth of the sample standard deviation in SETs.
Female lecturers receive teaching evaluation scores that are 0.25 standard deviations lower than those of male lecturers, after controlling for the characteristics of different courses. They also find that:
Women obtain considerably lower teacher evaluations when teaching with men compared to teaching alone or with other women.
When students have the opportunity to compare male and female lecturers within the same course, they give (on average) better teaching evaluation scores to the male lecturer. Finally, this bit was also interesting:
Interestingly, we find that the negative female teacher effect is reversed in the major for gender studies and social justice.
In gender studies and social justice, male lecturers received worse teaching evaluations. That might have something to do with the difference in gender composition of the students, but without student-level data, we would never know.
Moving on,
this 2017 article by Anne Boring (Sciences Po), published in the
Journal of Public Economics (ungated earlier version
here) finds similar results, based on
student-level teaching evaluation data for an unnamed university over the period from 2008/09 to 2013/14, which includes over 20,000 observations. That's right - Boring knows the individual evaluations that students gave (rather than the average overall rating), so can control for both teacher effects as well as student effects (so it a student routinely gives high, or low, ratings, that can be accounted for). She has data for six mandatory courses, where students are unable to select their teacher (and therefore, can't sort themselves into a section taught by a teacher of their preferred gender). She finds that:
...male students give significantly higher overall satisfaction scores to male professors than to female professors. Male students also rate male professors significantly higher than how female students rate both female and male professors... a male professor being rated by a male student is approximately 11 percentage points more likely to be rated as excellent compared to how he would be rated by a female student. As a result, a male professor’s expected excellent overall satisfaction score is approximately 20% higher than a female professor’s expected excellent overall satisfaction score. I also find that students perform equally well on final exams whether their professor was a man or a woman, suggesting no difference in actual teaching effectiveness. Thus, the results suggest that differences in teaching skills are not driving gender differences in evaluations.
Unlike male teachers, female teachers tend to receive similar scores from both male and female students. Notice that teaching effectiveness (as measured by exam performance) doesn't depend on gender of their teacher (which is a point that Alex Tabarrok made couple of times last year on the Marginal Revolution blog, see
here and
here). Digging a little deeper, Boring finds that:
...male and female students tend to give more favorable ratings to male professors on teaching dimensions that are associated with male stereotypes (of authoritativeness and knowledgeability), such as class leadership skills and the professor’s ability to contribute to students’ intellectual development. I find that, on average, students rate female professors similarly to male professors for teaching skills that are more closely associated with female stereotypes (of being warm and nurturing), such as preparation and organization of classes, quality of instructional materials, clarity of the assessment criteria, usefulness of feedback on assignments, and ability to encourage group work.
Gender stereotypes seem to matter.
This 2019 article by Whitney Buser (Young Harris College), Jill Hayter (East Tennessee State University), and Emily Marshall (Dickinson College), published in the
American Economic Review Papers and Proceedings issue (open access), uses student-level data from several unnamed universities, based on surveys conducted at three points during the semester. It's not entirely clear when the first survey was (perhaps on the second day of class?), but the other two surveys were collected on the day that the students' first exam was returned, and on the day of the final exam. Buser et al. have over 2200 survey responses in their sample, and they find that:
...statistically significant lower ratings of female professors at the beginning of the semester and after the first exam is returned. While ratings of male instructors also improve over the semester, female instructors have significantly lower ratings at the beginning of the semester and after the first exam grade is returned before eventually converging close to the ratings of male instructors.
So, this at least suggests that students' biases might reduce after more exposure to female lecturers, maybe? However, that doesn't explain the persistent end-of-course bias found in other studies though.
A very similar study to Anne Boring's was reported in
this 2018 article by Friederike Mengel (University of Essex), Jan Sauermann (Stockholm University), and Ulf Zölitz (University of Zurich), published in the
Journal of the European Economic Association (ungated earlier version
here). They use nearly 20,000 observations of student-level evaluation data from Maastricht University over the period 2009/10 to 2012/13, and again in a setting where students are randomly assigned to a section and a teacher. Their sample includes evaluations for some 735 lecturers. They find that:
...female faculty receive systematically lower teaching evaluations than their male colleagues despite the fact that neither students’ current or future grades nor their study hours are affected by the gender of the instructor. The lower teaching evaluations of female faculty stem mostly from male students, who evaluate their female instructors 21% of a standard deviation worse than their male instructors. Female students were found to rate female instructors about 8% of a standard deviation lower than male instructors.
Notice that the size of the bias is strikingly similar to that reported in Wagner et al. Mengel et al. also find two other interesting results:
When testing whether results differ by seniority, we find the effects to be driven by junior instructors, particularly Ph.D. students, who receive 28% of a standard deviation lower teaching evaluations than their male colleagues. Interestingly, we do not observe this gender bias for more senior female instructors like lecturers or professors. We do find, however, that the gender bias is substantially larger for courses with math-related content...
The gender bias against women is not only present in evaluation questions relating to the individual instructor, but also when students are asked to evaluate learning materials, such as text books, research articles, and the online learning platform. Strikingly, despite the fact that learning materials are identical for all students within a course and are independent of the gender of the section instructor, male students evaluate these worse when their instructor is female.
If you still haven't bought into the conclusion that SETs are seriously biased, the second result (gender bias spills over into how
textbooks are evaluated, even when students have the same textbook regardless of the gender of their teacher) probably should be giving you pause.
Finally, you might wonder whether these results are somehow unique to universities in high income countries. It turns out that isn't the case, as
this 2019 article by Carolyn Chisadza, Nicky Nicholls, and Eleni Yitbarek (university of Pretoria), published in the journal
Economics Letters (sorry, I don't see an ungated version of this one online), shows. Chisadza et al. asked 1599 first-year economics students to watch a 12-minute video, and then complete a quiz and a SET evaluation. Students were randomised as to the gender and race of the presenter on the video, but otherwise the script and the slides were the same. They found that:
...students give higher ratings to female and white lecturers. These differences are most pronounced for female and white students.
It's interesting that they find an effect in the
opposite direction to the other studies I highlighted earlier in the post. However, this study also isn't quite as convincing as those other studies, because it's limited to a small number of students in a single course. It does at least show that biases in SETs are probably not limited to universities in high-income countries (it would be interesting to see more studies of bias in SETs using data from universities in developing and middle-income countries, though).
All up, I think it is fairly safe to conclude that SETs are systematically biased, and those biases probably arise from stereotypes. The biases are also seriously consequential for faculty. Teaching evaluations are used in hiring and promotion decisions, and if they are systematically biased against particular groups, then those groups will be disadvantaged in their careers.
We need to re-think the measurement of teaching quality. Students are not consumers, and so we can't evaluate teaching the same way we would evaluate a transaction at a fast food restaurant, by surveying the 'customers'. There are alternatives to SETs that universities should make more use of, including teaching portfolios (where teachers have an opportunity to articulate their teaching approach and support it with evidence), and peer evaluations (which are used much more extensively at primary and secondary schools, for instance). Of course, these alternatives are neither as simple, nor as low-cost, as SETs. However, if you want an evaluation done right, sometimes you have to pay the full cost of conducting that evaluation.