Friday 20 January 2023

Grading bias, classroom behaviour, and assessing student knowledge

There is a large literature that documents teachers' biases in the grading of student assessments. For example, studies have used comparisons of subjectively graded (by teachers) assessments and objectively graded (or blind graded) assessments, to demonstrate gender bias and racial bias. However, grading bias may not just arise from demographics. Teachers may also show favouritism towards well-behaved students (relative to badly-behaved students). The challenge with demonstrating that bias is that researchers often lack detailed measures of student behaviour.

That is not the case for the research reported in this recent article by Bruno Ferman (Sao Paulo School of Economics) and Luiz Felipe Fontes (Insper Institute of Education and Research), published in the Journal of Public Economics (sorry, I don't see an ungated version online). They used data from a Brazilian private education company that manages schools across the country, and covered:

...about 23,000 students from grades 6-11 in 738 classrooms and 80 schools.

Importantly, the data includes student assessment results that were graded by their teacher, standardised test results that were machine-graded, and measures of student behaviour, which the company collected in order to "better predict their dropout and retention rates". Ferman and Fontes collate the behavioural data, and:

...classify a student being assessed in subject s and cycle t as well-behaved (GBits = 1) if she is in the top quartile within class in terms of good behavior notifications received until t by all teachers except the subject one. We classify bad-behaved students (BBits = 1) analogously.

They then compare maths test scores between well-behaved and badly-behaved students, and show that:

...the math test scores of ill-behaved students (BB = 1) are on average 0.31 SD below those such that BB = 0. The unconditional grade gap between students with GB = 1 and GB = 0 is even greater: 0.54 SD in favor of the better-behaved pupils.

So far, so unsurprising. Perhaps better-behaved students also study harder. However, when Ferman and Fontes control for blindly graded math scores, they find that:

...the behavior effects are significantly reduced, indicating that a share of the competence differences seen by teachers is captured by performance in the blindly-scored tests... Nevertheless, the behavior effects remain significant and are high in magnitude, indicating that teachers confound scholastic and behavioral skills when grading proficiency exams. Our results suggest that the better(worse)-behaved students have their scores inflated (deducted) by 0.14 SD...

This is quite a sizeable effect, amounting to "approximately 60% of the black-white achievement gap". And that is simply arising from teacher grading bias. Ferman and Fontes then go on to show that their results are robust to some alternative specifications, and that there is also apparent teacher bias in decisions of which students are allowed to move up to the next grade.

However, should we care about grading bias? Ferman and Fontes point out that their results:

...characterize an evaluation scheme that is condemned by educators and classroom assessment specialists, which explicitly warn against the adjustment of test scores to reflect students’ behavior... and consider this practice as unethical... Their argument is that achievement grades are the main source of feedback teachers send about the students’ proficiency levels. Therefore, test scores help pupils form perceptions about their own aptitudes and assist them in the process of self-regulation of learning; additionally, they help parents to understand how to allocate effort to improve their children’s academic achievement...

Still, one could argue that biasing test scores may be socially desirable if it induces a student to behave better, generating private benefits to the pupil and positive externalities to peers...

Let me suggest another counterpoint. If grades are a signal to universities or to employers about the relative ranking of students in terms of performance, then maybe you want those grades to reflect students' behaviour as well as students' attainment of learning outcomes. You might disagree, but I'd argue that there are already elements of this in the way that we grade students (in high schools and universities) already. If teachers (and educational institutions) were purists about grades reflecting student learning alone, then we would never estimate student grades for students who miss a piece of assessment, we would never scale grades (up or down). The fact that we do those things (and did so especially during the pandemic) suggests that student grades already can't be interpreted solely as reflecting students' attainment of learning outcomes.

Employers (and universities) want grades that will be predictive of how a student will perform in the future. However, academic achievement is an imperfect measure of future performance of students. This is demonstrated clearly in this recent article by Georg Graetz and Arizo Karimi (both Uppsala University), published in the journal Economics of Education Review (open access). They used administrative data from Sweden, focusing mainly on the cohort of students born in 1992. Graetz and Karimi are most interested in explaining a gender gap that exists between high school grades (where female students do better) and the standardised Swedish SAT tests (where male students do better). Specifically:

...female students, on average, outperform male students on both compulsory school and high school GPAs by about a third of a standard deviation. At the same time, the reverse is true for the Swedish SAT, where female test takers underperform relative to male test takers by a third of a standard deviation...

Graetz and Karimi find that differences in cognitive skills, motivation, and effort explain more than half of the difference in GPAs between female and male students, and that female students have higher motivation and exert greater effort. In contrast, there is selection bias in the SAT scores. This arises in part because Swedish students can qualify for university based on grades, or based on SAT scores. So, students that already have high grades are less likely to sit the SATs. Since more of those students are females with high cognitive skills, the remaining students who sit the SAT test disproportionately include high-cognitive-skill males, which is why males on average perform better in the Swedish SATs.

However, aside from being kind of interesting, that is not the important aspect of the Graetz and Karimi paper that I want to highlight. They then go on to look at the post-high-school outcomes for students born in 1982, and look at how those outcomes relate to grades and SAT scores. In this analysis, they find that:

Grades and SAT scores are strong predictors of college graduation, but grades appear about twice as important as SAT scores, with standardized coefficients around 0.25 compared to just over 0.1...

A one-standard-deviation increase in CSGPA and HSGPA is associated with an increase in annual earnings of SEK15,500 and 25,200, respectively (SEK1,000 is equal to about USD100). But for the SAT score, the increase is only SEK8,000.

In other words, high school grades are a better predictor of both university outcomes (graduation) and employment outcomes (earnings) than standardised tests. This should not be surprising, given that, when compared with standardised tests, grades may better capture student effort and motivation, which will be predictive of student success in university and in employment. And to the extent that good student behaviour is also associated with higher motivation and greater effort, perhaps we want grades to reflect that too. [*]

None of this is to say that we shouldn't be assessing student knowledge. It's more that grades that represent a more holistic measure of student success, will be more useful in predicting future student performance. That is more helpful for employers, and as a result it may be more helpful for encouraging students to study harder as well.

*****

[*] Of course, selection bias matters here too. In the case of the Swedish SATs, the most motivated and hardest working students may have opted out of the SAT test entirely. However, the analysis that Graetz and Karimi undertook is (I think) limited to students who had both grades and SAT scores recorded.

No comments:

Post a Comment