Sex, Drugs and Economics: Are student evaluations of teaching even measuring teaching quality?

Monday, 11 May 2020

Are student evaluations of teaching even measuring teaching quality?

A few months ago, I wrote a post about gender biases in student evaluations of teaching, concluding:

All up, I think it is fairly safe to conclude that SETs are systematically biased, and those biases probably arise from stereotypes...

We need to re-think the measurement of teaching quality. Students are not consumers, and so we can't evaluate teaching the same way we would evaluate a transaction at a fast food restaurant, by surveying the 'customers'.

You might have taken away from that post that within a gender or ethnic group, the ranking of teachers might be a suitable measure, even if the relative ranking between those groups led to bias overall. After all, there are decades of research and meta-analyses (such as this heavily cited one by Peter Cohen).

However, I just finished reading this 2017 article by Bob Uttl (Mount Royal University), Carmela White (University of British Columbia), and Daniela Wong Gonzalez (University of Windsor), published in the journal Studies in Educational Evaluation (appears to be open access, but just in case there is an ungated version here), which calls into question all of the previous meta-analyses (including, and especially, the Cohen meta-analysis). All of the papers in these meta-analyses are based on multi-section designs. Uttl et al. explain that:

An ideal multisection study design includes the following features: a course has many equivalent sections following the same outline and having the same assessments, students are randomly assigned to sections, each section is taught by a different instructor, all instructors are evaluated using SETs at the same time and before a final exam, and student learning is assessed using the same final exam. If students learn more from more highly rated professors, sections' average SET ratings and sections' average final exam scores should be positively correlated.

Uttl et al. first replicate the meta-analyses conducted in several past papers, correcting for small study bias - the idea that studies with only a small number of observations (in this case, a small number of sections) are more likely to report large effects, even if the 'true' effect is zero. In the case of re-analysing Cohen's data, they report that:

...Cohen’s (1981) conclusion that SET/learning correlations are substantial and that SET ratings explain 18–25% of variability in learning measures is not supported by our reanalyses of Cohen's own data. The re-analyses indicate that SET ratings explain at best 10% of variance in learning measures. The inflated SET/learning correlations reported by Cohen appear to be an artifact of small study effects, most likely arising from publication bias.

They find similar in re-analyses of other past meta-analyses, and then go on to conduct their own thorough meta-analysis of all studies up to January 2016, which included 51 articles reporting the results of 97 studies. They find that:

...when the analyses include both multisection studies with and without prior learning/ability controls, the estimated SET/learning correlations are very weak with SET ratings accounting for up to 1% of variance in learning/achievement measures... when only those multisection studies that controlled for prior learning/achievement are included in the analyses, the SET/learning correlations are not significantly different from zero.

In other words, there is no observed correlation between teaching quality (as measured by final grade or final exam mark or similar) and student evaluations of teaching. Better teachers (as measured by their evaluation scores) do not provide better teaching (as measured by student outcomes). Uttl et al. conclude that:

Despite more than 75 years of sustained effort, there is presently no evidence supporting the widespread belief that students learn more from professors who receive higher SET ratings.

That seems to simply strengthen my conclusion from the previous post, that we need to re-think the measurement of teaching quality. At a time when our teaching methods have been thrown into chaos by the coronavirus lockdown, it seems like this might be an opportune time to rethink evaluation and build something more reliable.

Read more: