Monday, 15 December 2025

Grade inflation at New Zealand universities, and what can be done about it

Grade inflation at New Zealand universities has been in the news recently. This is a delayed reaction to this report from the New Zealand Initiative released back in August, authored by James Kierstead. He collected data on grade distributions from all eight New Zealand universities (via Official Information Act requests), and looks at how those distributions have changed over time. The results are a clear demonstration of grade inflation, and most clearly demonstrated in Figure 2.1 from the report:

Over the period from the mid-2000s to 2024, the proportion of New Zealand university students receiving a grade in the A range has increased at every New Zealand university, and by more than ten percentage points overall. Kierstead notes that:

Overall, the median proportion of A-grades grew by 13 percentage points, from 22% to 35%... The largest increases occurred at Lincoln, where the proportion of As grew by 24 percentage points between 2010 and 2024 (from 15% to 39%), more than doubling, and Massey, where they grew by 17 percentage points (from 19% to 36%) from 2006 to 2023.

A similar pattern of increases, although not as striking, is seen for pass rates, which in 2024 were above 90 percent at every university except Auckland. The results are also apparent across different disciplines, as shown in Figure 2.4 from the report:

Of course, this sort of grade inflation is common across other countries as well, and Kierstead provides a comparison that shows that New Zealand grade inflation is not dissimilar from grade inflation in the US, UK, Australia, and Canada.

Kierstead then turns his attention to why there has been grade inflation. He first dismisses some possible explanations such as better incoming students (NCEA results have not improved, although even if they had that might be due to grade inflation as well), more female students (the proportion of female students has been flat over the past ten years, while grades have continued to increase), better funding (bwahahahaha - in fact, funding per student has declined in real terms since 2019, while grades have continued to increase), and student-staff ratios (which have declined over time, but the student-academic ratio, which is the one that should matter most, has barely changed).

So, what has caused grade inflation? Kierstead describes it as a collective action problem, akin to the tragedy of the commons first described by Garret Hardin in 1968:

It is our contention that grade inflation is the product of a dynamic that is not dissimilar to the tragedy of the commons. Just like Hardin’s villagers, academics pursue a good (in this case high student numbers) in a rational way (in this case by awarding more high grades). And just as with Hardin’s villagers, negative consequences ensue, with a common resource (sound grading) being depleted, to the cost of every individual academic as well as others...

In the grade inflation game, the good that academics want to maximize is student numbers. Individual academics, on the whole, want to have as many students in their courses as possible. This suggests that they are popular teachers and can help get them promoted (and hence gain more money and prestige). It can also help make sure the courses they want to teach stay on the menu.

I like this general framing of the problem, where 'sound grading' is a common resource - a good that is rival and non-excludable. However, I would change it slightly, by thinking about the common resource as being A grades generally, which are depleted when the credibility of those grades reduces. In my slightly different framing, awarding A grades is rival in the sense that one person awarding more A grades reduces the credibility of A grades awarded by others. Awarding A grades is non-excludable in the sense that if anyone can award A grades, everyone can award A grades (while it is possible to prevent academics from awarding A grades, universities would probably prefer not to do so because that would reduce student satisfaction). So, while the social incentive for all academics collectively is to reduce the award of A grades to keep the credibility of those grades high, the private incentive for each academic individually is to increase the proportion of A grades awarded, leading to fame and fortune (or, more likely, leading to fewer awkward conversations with their Head of School as to why their grade distribution is too low, as well as better student evaluations - see here and here, for example). Essentially then, the incentives are for academics to inflate grades. The universities have few incentives to act to reduce grade inflation, since higher grades increase student satisfaction and lead to greater enrolments.

However, there is a problem. As Kierstead notes, grade inflation is well-termed because its effects are similar to the inflation that economists are more familiar with:

If universities hand out more and more As in a way that isn’t justified by student performance, the value of an A will go down. The same job opportunities will ‘cost’ more As as As flood the market. Students who worked hard will see the value of their As decrease over time, just as workers in the economy see their savings decrease in value due to monetary inflation.

So, what to do? Kierstead offers a few solutions in the report, including moderation of grades, reporting grades differently on transcripts, calculating grades differently, making post-hoc adjustments to grade point averages, having national standardised exams by discipline, changing the way that universities are funded to reduce the incentive to inflate grades, changing the culture of academics, and giving out prizes for 'sound grading'. I'm not going to dig into those different solutions, because sometimes the simplest one is the best one. With that in mind, I pick this:

Perhaps the simplest addition that could be made to student transcripts alongside letter grades is the rank that students achieved out of the total number of students on the course. So a student’s transcript might read, for example, ‘Classics 106: Ancient Civilizations: A- (27th of 252).’...

Adding ranking information restores some of the signalling value of grades without needing to reverse grade inflation itself. To see why, consider an example. If an employer has the transcripts of two students, one of whom got an A- grade in econometrics and ranked 17th out of 22 students, while the other student got a B grade and ranked 3rd out of 29 students, it's pretty clear that the grade might not be capturing the full picture of the students' relative merit. Kierstead worries about this simple solution because:

A limitation of rank-ordering is that it might suggest that students who achieved only a lowly ranking had performed badly, whereas they might well have performed very well in an especially difficult course.

Possibly, but the key point is not how well students did in the course, but how well they did relative to the other students in the class, which is exactly what the ranking provides. The benefit of this approach is that providing a ranking alongside the grade would reduce the incentives for students to cherry pick easy papers that award high grades, because a high grade on its own would not necessarily lead to a good ranking within the class.

Of course, there are potential problems with the simple solution. One such problem is that comparisons across different cohorts of students might not be fair. Taking the example of the two students I gave earlier, perhaps the student who got an A- grade and ranked 17/22 completed the paper in a cohort that was particularly smart, while the student who got a B grade and ranked 3/29 completed the paper in a cohort that was less smart. In that case, the grade without the ranking might be a better measure.

Kierstead's more complex solutions don't really deal well with the problem of between-cohort comparisons, and suffer from being more complicated for non-specialists to understand. A simple ranking, or a percentile ranking, is relatively easy for HR managers to interpret. Having said that, the between-cohort comparisons issue might not be too much of a problem in any case. My experience though, is for classes of a sufficiently large size (30 or more), the grade distributions do not differ materially (and if they do, it is usually because of the teaching or the assessment, not the students).

I can see some incentive issues though. Would students start to choose papers that they suspect that many weak students complete? Good students might anticipate that this would lead to a higher grade and a better ranking, which will look better on their transcript. On the other hand, is that really any worse than what students are doing now, if they choose papers that give out easy grades?

There are also potential issues with stigmatising students who end up near the bottom of a large class (how dispiriting would it be to have your transcript say you got a grade of E, and ranked 317th out of 319 students?). Of course, that could be solved to some extent by only providing ranking information for students with passing grades. And consideration would also be needed for how to deal with very small classes (is a ranking of 4th out of 5 students meaningful?).

Grade inflation is clearly a problem. It's not just nostalgia to say that an A grade is not what it used to be. Grade inflation has real consequences for employers, because the signalling value of high grades is reduced (see here for more on signalling in education). This means that there are also real consequences for high-quality students, who find it more difficult to differentiate themselves from average students. Solving this problem shouldn't involve government intervention to change university funding formulas, or trying to change academic culture. It shouldn't involve complicated statistical manipulations of grades. It really could be as simple as reporting students' within-class ranking on their academic transcripts.

The question now is whether any university would take it on themselves to do so. The credibility of university grades depends on it.

[HT: Josh McNamara, earlier in the year]

Read more:

Sunday, 14 December 2025

Online and blended learning lead to similar outcomes on average, at lower cost but lower student satisfaction

It's been a while since I've written about online or blended learning, which may seem surprising given the ample opportunities for us to learn about online learning during the pandemic. Perhaps I'm still dealing with the trauma of that, or perhaps I have just pivoted more to understanding the emerging role of AI in education. Nevertheless, I recently dipped my toes back into the research on online and blended learning, reading this 2020 article by Igor Chirikov (University of California, Berkeley) and co-authors, published in the journal Science Advances (open access).

Chirikov et al. evaluate a large multisite randomised controlled trial of online and blended learning in engineering, across three universities in Russia. As they explain:

In the 2017–2018 academic year, we selected two required semester-long STEM courses [Engineering Mechanics (EM) and Construction Materials Technology (CMT)] at three participating, resource-constrained higher education institutions in Russia. These courses were available in-person at the student’s home institution and alternatively online through OpenEdu. We randomly assigned students to one of three conditions: (i) taking the course in-person with lectures and discussion groups with the instructor who usually teaches the course at the university, (ii) taking the same course in the blended format with online lectures and in-person discussion groups with the same instructor as in the in-person modality, and (iii) taking the course fully online.

The course content (learning outcomes, course topics, required literature, and assignments) was identical for all students.

Their sample is made up of 325 second-year university students, with 101 randomly assigned to in-person, 100 to blended, and 124 to online. All students then completed the same final examination. Looking at student performance, Chirikov et al. find:

...minimal evidence that final exam scores differ by condition (F = 0.26, P = 0.77)... The average assessment score varied significantly by condition (F = 3.24, P = 0.039): Students under the in-person and blended conditions have similar average assessment scores (t = 0.26, P = 0.80), but those under the online condition scored 7.2 percentage points higher (t = 2.52, P = 0.012). This effect is likely an artifact of the more lenient assessment submission policy for online students, who were permitted three attempts on the weekly assignments.

The lack of a difference in student performance on average across different learning modes is a common feature of the literature (see the links at the end of this post). It would have been interesting if Chirikov et al. had undertaken a heterogeneity analysis to see whether online and blended modes advantage the more able and engaged students, while disadvantaging the less able and engaged students (also a feature of the literature on online and blended learning). The general result that online and blended learning provides benefits for top students but harms weaker ones is a point I’ve discussed many times before (see the links below for more).

Chirikov et al. then look at student satisfaction, and despite claiming that "we find minimal evidence that student satisfaction differs by condition", Table 3 in the paper does show that students in the online mode report a statistically significant five percentage points lower satisfaction than in-person students, while students in the blended mode report lower satisfaction (by about 2-2.5 percentage points) than in-person students, although the latter difference was not statistically significant.

Finally, Chirikov et al. evaluate the effect on the cost of education, finding that:

Compared to the instructor compensation cost of in-person instruction, blended instruction lowers the per-student cost by 19.2% for EM and 15.4% for CMT; online instruction lowers it by 80.9% for EM and 79.1% for CMT...

These cost savings can fund increases in STEM enrollment with the same state funding. Conservatively assuming that all other costs per student besides instructor compensation at each university remain constant, resource-constrained universities could teach 3.4% more students in EM and 2.5% more students in CMT if they adopted blended instruction. If universities relied on online instruction, then they could teach 18.2% more students in EM and 15.0% more students in CMT.

I don't think it will come as a surprise to anyone that online and blended learning are more cost-effective. There is little doubt that it has factored into some of the push towards online and blended learning across higher education over time.

Given that, in this study, both online and blended learning lead to similar outcomes on average, one might be tempted to suggest that they are good value for money from the university’s or funder's perspective. For cash-strapped institutions (or governments), the temptation to expand online provision on the back of such numbers is obvious. However, we should be cautious about drawing that conclusion. The lower student satisfaction in the blended and (especially) online modes should be a worry (at least to those who care about student satisfaction). And, as alluded to earlier, the average student performance can hide important heterogeneity between more engaged and less engaged students.

The real question here isn’t whether online and blended learning can be as effective on average, but whether we are comfortable trading lower satisfaction and potential for harms to less engaged students for lower cost of delivery and higher enrolments.

Read more:

Saturday, 13 December 2025

This Kansas City Chiefs conspiracy theory article is a mess

I have to admit to experiencing a non-trivial amount of schadenfreude this year, as the Kansas City Chiefs find themselves with a losing record in December for the first time in a decade. My mild animosity towards the Chiefs is based entirely on their supreme performance over that decade. After they've had a few losing seasons, I won't care anymore (which is how I feel about the Patriots right about now). However, there are plenty of people who have griped about the Chiefs, and claimed that the Chiefs receive favourable referee calls.

I'd label that a conspiracy theory, but it has apparently caught the attention of researchers. This recent article by Spencer Barnes (University of Texas at El Paso), Ted Dischman (an independent researcher), and Brandon Mendez (University of South Carolina), published in the journal Financial Review (sorry, I don't see an ungated version online), explicitly tests whether the Kansas City Chiefs receive favourable referee calls. Specifically, Barnes et al.:

...compare penalty calls benefiting the Mahomes-era Kansas City Chiefs (from 2018 to 2023) and the Brady-era New England Patriots (2015–2019) across the regular and postseason...

Barnes et al. argue that:

...financial pressures, particularly those related to TV revenue (the primary source of revenue for the NFL), serve as the underlying mechanism.

In other words, Barnes et al. claim that the NFL has a strong financial incentive to bias officiating in favour of the 2018-2023 Kansas City Chiefs, to a greater extent than any bias in favour of the 2015-2019 New England Patriots. As we’ll see, the empirical strategy is poorly chosen, parts of the results are misinterpreted, and the proposed TV-revenue mechanism is implausible. All up, you shouldn't believe this paper's results.

What did they do? Barnes et al. use play-by-play data covering the 2015 to 2023 seasons. They restrict their attention to defensive penalties only, which gives them a sample of 13,136 penalties across 2435 games. They apply a fairly simple linear regression model to the data:

Here we find the first problem with their analysis. If you want to show that the Mahomes-era Kansas City Chiefs benefited from more defensive penalties than other teams, you should be running a difference-in-differences analysis. Essentially, you compare the difference between the Chiefs and other teams, between the period before and the period after Patrick Mahomes started playing. In other words, you should test whether the Chiefs’ advantage in penalties grows after Mahomes started playing, compared with their earlier advantage and with other teams over the same period. Barnes et al. simply test for a level difference between the Chiefs and other teams during that time (using the 'Dynasty' variable), but fail to account for whether the Chiefs might already benefit from more defensive penalties before Mahomes became the starting quarterback (in 2018). Indeed, Figure 1 in the paper shows that the Chiefs did benefit from more defensive penalties per game before 2018:

That difference prior to 2018 should be controlled for. Having said that, the difference from the rest of the NFL teams looks bigger from 2018 onwards (but mostly concentrated in 2018-19, and in 2023), so if they had used the more correct difference-in-differences model (or, when comparing regular and post-season, a triple-differences model), they might still have found a statistically significant effect.

There is a further, albeit more minor, issue with the analysis. Barnes et al. control for 'defensive team fixed effects', which they argue controls "for differences in how opposing teams play defense and how frequently they are penalized". However, teams change the way they play defence, particularly when the defensive coordinator changes. So really, they should have used defensive-team-by-season fixed effects there, which would allow the way a team plays (and gets penalised) to vary from season to season, and control for that.

Barnes et al. look at the effect on several outcome variables:

Our primary dependent variables capture different dimensions of officiating decisions. The first is Penalty Yards, which measures the total yards gained or lost due to penalty calls. If the NFL or its officials favor a particular team, we expect them to benefit from potentially more penalty yards assessed against their opponents. The second variable, First Down, is a binary indicator that takes a value of 1 if a penalty call results in an automatic first down. Because first downs have a direct impact on a team’s ability to sustain drives and score points, this measure captures whether penalties disproportionately help a team advance the ball. The third variable, Subjective, is a binary indicator equal to 1 if the defensive penalty falls into a category requiring referee discretion...

The 'Subjective' variable is described in the appendix to the paper, and appears to be far too inclusive since it includes penalties like 'Face Mask' and 'Horse Collar Tackle' that seem to me not to be particularly subjective (and those two categories alone made up 6 percent of all penalties, and a much higher proportion of the 'subjective' penalties).

Putting aside the issues with the analysis for a moment, Barnes et al. find that:

...penalties against Kansas City during the regular season result in 2.02 fewer yards (𝑝 < 0.01), are 8 percentage points less likely to have a penalty call that results in a first down (𝑝 < 0.01), and are 7 percentage points less likely to have subjective penalties (𝑝 < 0.05) compared to the rest of the NFL. This pattern is decisively reversed in postseason contests, where penalties against the Chiefs offense yield 2.36 more yards (𝑝 < 0.05), are 23 percentage points more likely to have a penalty call that results in a first down (𝑝 < 0.01), and are 28 percentage points more likely to have subjective calls (𝑝 < 0.01) compared to the rest of the NFL in the playoffs.

Barnes et al. have explained this incorrectly. Notice their wording suggests the penalties are called on Kansas City (i.e. hurting the Chiefs). Their analysis actually shows that penalties against Kansas City Chiefs' opponents result in 2.02 fewer yards during the regular season, and penalties against Kansas City Chiefs' opponents (not the Chiefs offense) yield 2.36 more yards in the postseason. At least, that is according to the notes to their Table 3, which says:

The dependent variable in Columns (1) and (4) is the realized yardage for the offensive team resulting from a penalty on the defensive team... The independent variable of interest, Kansas City Chiefs, is a binary indicator variable that equals 1 if the offensive team is the Kansas City Chiefs and 0 otherwise.

So, the correct way of interpreting those results is penalties against the opposing defence, not penalties against Kansas City. Barnes et al. then turn to applying the same analysis to the 2015-2019 New England Patriots, and find effects that are mostly statistically insignificant (and small). For other teams that might arguably be called a 'dynasty' (for a sufficiently low bar for what constitutes a dynasty, Barnes et al. find no evidence of differences in defensive penalty calls. That sample includes the Philadelphia Eagles (2017-2023), the Los Angeles Rams (2018-2023), and the San Francisco 49ers (2019-2023).

At this point, the problem with the mechanism starts to become clear. Barnes et al. start to look at TV viewership, and argue that:

If certain teams, particularly those associated with high-profile players, systematically attract larger audiences, then maintaining the success or visibility of those teams may align with the league’s broader financial interests.

If the NFL wanted to attract a larger audience, and aimed to do so by biasing officiating in favour of a particular team, why on earth would they choose a small market team like Kansas City? Surely they would want to boost a large-market team? According to this ranking, Kansas City is only the 35th-largest sports media market in the US. Now, Patrick Mahomes is a star quarterback (he was the 10th overall pick in the 2016 NFL draft), so maybe it's the combination of star quarterback and media market that matters. However, Tom Brady was also a star quarterback, and Boston is the 10th-largest sports media market. So, why weren't the Patriots getting favourable calls in 2015-2019? If, as Barnes et al. seem to argue, the NFL was going through some particular challenges in 2016, then Kansas City is still not the obvious choice for biased officiating. They should have favoured the LA Rams (in the second-largest sports media market, with star quarterback Jared Goff, the first overall pick in the 2016 NFL draft).

Barnes' et al.'s argument falls apart. Their TV viewership analysis does show that:

...the Chiefs’ emergence as a marquee team coincided with a material increase in viewership interest, consistent with the broader financial incentives we hypothesize.

However, that analysis also has issues, because they don't control for the win/loss record of the teams in each game (and winning teams likely attract more TV viewers). And, all it really tells you is that Patrick Mahomes attracts a big TV audience. He is a good player. That's what they do. Higher ratings for teams with star players is not evidence that referees are biased. As noted above, if the NFL thought that way, they should have preferred biasing the officiating towards the LA Rams instead, and Barnes et al.'s analysis shows that didn't happen.

As a final point, there is a real risk that the analysis in this paper gets causality backwards. Did the Chiefs get favourable referee calls because they are a dynasty, or did they become a dynasty because they received favourable referee calls at key moments? Barnes et al. never consider the possibility of reverse causality. Overall, the paper does much more to flatter an existing conspiracy theory than to seriously test it. Even if we take their estimates at face value, nothing in the paper convincingly links referee calls to incentives to increase NFL TV viewership.

[HT: Marginal Revolution]

Friday, 12 December 2025

This week in research #105

Here's what caught my eye in research over the past week:

  • Gillespie et al. (with ungated earlier version here) find evidence of landlord exit from the rental market, specifically after rent controls were tightened in 2021 in Ireland, meaning that rent controls are associated with more sale listings and fewer rental listings/registrations
  • Pagani and Pica (open access) find that exposure to a higher share of same-gender math high achievers is related to better academic performance among Italian primary school children, for both boys and girls, three years later
  • Dutta, Gandhi, and Green (open access) find, using data from India, that relaxing rent control leads to higher rents and decreases rural-urban migration, while easing eviction laws increases the conversion of rental units into owner-occupied housing and increases the prevalence of 'marriage migrants'
  • Couture and Smit find no evidence that Federal Open Market Committee officials in the US select securities that earn abnormal returns
  • Bergvall et al. (open access) find, using Swedish data, that find that following the start of their PhD studies, psychiatric medication use among PhD students increases substantially, continuing throughout their studies to the point that by the fifth year medication use has increased by 40 percent compared to pre-PhD levels (more reason to worry about the mental health of PhD students)
  • Bagues and Villa (open access) find that, after Spanish regions increased the minimum legal drinking age from 16 to 18 years, alcohol consumption among adolescents aged 14-17 decreased by 7 to 17 percent and exam performance improved by 4 percent of a standard deviation
  • Fan, Tang, and Zhang find, using data on university relocations in China in the 1950s, that there were substantial effects on total employment, firm numbers, and productivity in industries technologically related to the relocated departments
  • Chikish and Humphreys find that surgical repair of UCL injuries extends post-injury MLB pitcher careers by roughly 1.3 seasons relative to matched uninjured pitchers, and that post-injury and treatment pitcher performance improves by roughly 8 percent
  • Chegere et al. (open access) conduct an experiment investigating how regular sports bettors in urban Tanzania value sports bets and form expectations about winning probabilities and find that people assign higher certainty equivalents and winning probabilities to sports bets than to urn-and-balls lotteries with identical odds, even though, in fact, they are not more likely to win
  • Seak et al. (with ungated earlier version here) find that experimental choices by both humans and monkeys violated the independence axiom across a broad range of reward probabilities (both monkeys and humans are not purely rational decision-makers)