Saturday, 31 May 2025

We should be cautious about employing AI tutors as a substitute for human tutors

I'm becoming more convinced over time that there are good approaches to using large language models (LLMs) at tutors, and not-so-good approaches. That statement seems kind of obvious, so let me unpack it a little bit. 

There are two main ways that teachers appear to be using LLMs as tutors. The first approach is to give the LLM tutorial or workshop material (or problem sets, or questions), and prompt the LLM tutor to walk the student through the tutorial or workshop (or whatever). The LLM may be prompted to act as a Socratic tutor, encouraging the student to answer questions, while the LLM tutor doesn't directly provide answers. This is the style of tutor employed in the paper I discussed in this post (as one example). This does seem to be the way that most teachers are approaching this, because it is most often the way that we expect human tutors to engage with students. LLM tutors set up in this way are substitutes for human tutors.

The second approach to using an LLM as a tutor is to set it up as a source of answers for student queries. The LLM has a basic prompt, as well as access to the teaching materials, and students can ask questions and the LLM will provide answers, additional context, or examples. This is how we have set up Harriet, our ECONS101 AI tutor. These LLM tutors are not trying to mimic the human tutor, but instead are essentially a reference textbook that students can query and converse with. Indeed, this is probably what the future of textbook publishing will look like, with textbooks replaced by chatbots. LLM tutors set up in this way are complements for human tutors.

On the surface, either approach to employing LLM tutors is valid. Teachers with a strong grounding in the education literature are likely to prefer the first way. However, in my view, LLM tutors are not human tutors. We shouldn't be expecting students to engage with LLM tutors in the same way that they engage with human tutors. At this stage, I believe that LLM tutors might be better as complements to human tutoring, rather than substitutes (which is why Harriet is set up the way that she is).

Now, some new evidence suggests that LLM tutors, when trained to be substitutes, may fall short in important ways. This article by Armin Alimardani (University of Wollongong) and Emma Jane (University of New South Wales Sydney), forthcoming in the journal Law, Technology and Humans (open access, with a non-technical summary on The Conversation), looks at the use of a ChatGPT tutor in several law classes at the University of Wollongong in Australia. This research is interesting, because the initial implementation of the AI tutor predates the release of ChatGPT (they used a pre-release version of GPT-4). However, Alimardani and Jane also report on results from more recent versions of ChatGPT. In their initial implementation:

...we created SmartTest, an educational chatbot with features designed to align with educational objectives. The chatbot allows educators to pose pre-determined questions to students, ensuring that the topics and focus areas match learning goals. Feedback is generated based on educator-drafted answers to provide a coherent learning path...

The system prompts in the test cycles included three main sections: (1) Instructions—SmartTest’s role and the steps it should take when interacting with students. For example, if a student asks an unrelated question, SmartTest refuses to answer and redirects them to the test cycle question; (2) The Questions; and (3) The Answer Guide.

Essentially, this is GPT as a substitute for a human tutor. Alimardani and Jane are able to see all interactions between the LLM tutor and students, which provides a wealth of data, even though the actual number of students is relatively small (fewer than 50 students in the second semester of 2023). It also appears that SmartTest was available to use during class time as part of a regular tutorial (although this point is not entirely clear from the article). Anyway, Alimardani and Jane made SmartTest available to students across five 'test cycles', each two weeks long. Looking at the conversations, they:

...reviewed SmartTest’s interactions with students to determine the rate of erroneous feedback. Errors were defined as situations in which SmartTest provided incorrect information, failed to adequately correct a student’s inaccurate or partial response, or provided confusing feedback. Errors included cases where a student gave an incorrect or partial response, and SmartTest replied with a positive statement (e.g., ‘That’s correct!’), before presenting the correct answer.

Alimardani and Jane find that:

...during the initial three test cycles, which featured short problem scenarios, the percentage of conversations containing at least one error ranged from 39.5% to 53.5%. By contrast, in Cycles 4 and 5, which involved short-answer questions, the error rate dropped to a range of 6.3% to 26.9%. This variation in performance is partly explained by the difference in the complexity of questions and answer guides in earlier cycles.

It is important to note that they made the questions that SmartTest asked students much shorter and less detailed in the later cycles. However, the error rate is quite worrying, especially when you consider that SmartTest had access to the answers!

Alimardani and Jane then tested more recent iterations of ChatGPT (ChatGPT-4o, GPT-4-Turbo, o1, o1-mini, o3-mini, and GPT-4.5), not with students, but by pre-seeding some of the conversations where the GPT-4 version of SmartTest had made errors. They find that:

On average, GPT-4o and GPT-4.5 achieved similar scores and outperformed the other models. Notably, the average performance of the reasoning models (o1, o1-mini and o3-mini) was worse than the other models.

The more recent models did not outperform GPT-4, and in fact the more complex reasoning models did worse! That result was a real surprise to me, especially when you consider that LLM performance is trending upwards over time. Perhaps the more recent LLMs are more willing to deviate from the details in the prompt (which includes the questions and the answers to the tutorial problems)? Alimardani and Jane aren't able to provide a satisfactory answer as to why the latest models perform worse, or a concrete solution for that problem.

Despite the errors, LLM tutors might still have value if the students find them helpful. Remember, human tutors are not perfect either! Alimardani and Jane:

...asked students to identify the most appealing features of SmartTest. The feature that received the most votes (n=42) was the instant feedback provided by the chatbot. This was followed by two features, each receiving 28 votes: the conversational format, which allows students to break down their answers and receive feedback on each component separately, and the option to express uncertainty to receive guidance...

Despite identifying instant feedback from SmartTest as its most appealing feature, the students would still prefer feedback from human tutors. Alimardani and Jane:

...asked students to rank their preferred mode of feedback to explore whether the delay in receiving feedback from tutors would influence students’ preference for SmartTest. We found that 51.52% (17 students) selected receiving feedback from their tutor on their learning management system (LMS) with a delay of one or more days as their top choice. In contrast, only 27.27% (9 students) preferred using SmartTest for feedback as their first option.

These results demonstrate that students themselves don't prefer LLM tutors as a substitute for human tutors. That would explain the University of Auckland's recent experience with an AI tutor in marketing. Future versions of LLMs may do a better job as a substitute for human tutors (although Alimardani and Jane's results should make us a little wary about that). However, for now, an LLM tutor may be best employed as a complement for human tutoring.

[HT: The Conversation]

Read more:

Friday, 30 May 2025

This week in research #77

Here's what caught my eye in research over the past week:

  • Battistoni and Martinez (open access) propose a new approach for quantifying ancient inequality and its evolution by relying on inscriptions that indicate property data and artisanal remunerations, then use their approach to estimate inequality in the Greek city-state of Delos in 280 BCE and 190 BCE
  • Radatz and Baten (open access) construct a new index of composite inequality over the period from 1810 to 2010, and find that the risk of a civil war outbreak increases with higher levels of within-country inequality

Also new from the Waikato working papers series:

  • Valera, Holmes, and Delloro study the effects of fuel and rice prices on overall inflation in the Philippines, and find that there is a strong impact during periods of higher inflation, which they then also confirm for Indonesia, Thailand, and India

Tuesday, 27 May 2025

How good is ChatGPT as an economics tutor?

In this post over the weekend, I discussed a paper that looked at the performance of ChatGPT 3.5 in answering economics homework questions (multiple choice, and long answer). Obviously, lots of teachers are interested in the topic of how well large language models perform in answering questions, and lots of researchers (who are also teachers) are investigating the performance of large language models as tutors, and the effects on student learning (including me - more on that in a future post).

Another paper that looks at how well ChatGPT performs as an economics tutor is this working paper by Natalie Bröse (University of Applied Sciences Bonn-Rhein-Sieg), Christian Spielmann (University of Bristol), and Christian Tode (University of Applied Sciences Bonn-Rhein-Sieg). In the paper, Bröse et al.:

...analyze the performance of three ChatGPT models (GPT-3.5, GPT-4o, and o1preview) in two key tasks: explaining economic concepts and answering multiple-choice questions with explanations. We use CORE Econ’s The Economy 1.0 (Bowles, Carlin, and Stevens 2017) as our reference material and evaluate ChatGPT’s responses across multiple dimensions relevant to student learning.

It is interesting (to me) that Bröse et al. use the CORE Econ text, because that is the text we use as a base for my ECONS101 paper (although we do not follow it particularly closely). They also undertake quite a detailed investigation of the quality and characteristics of ChatGPT's responses to questions about economic concepts, as well as the characteristics of multiple-choice questions that ChatGPT gets right or wrong. This adds a lot of depth to the analysis, and is really helpful for those of us who are training or testing our own economics AI tutors, because it highlights some areas where even the most recent iterations of ChatGPT (like 01preview) still seem to struggle.

When looking at the explanation of economic concepts, Bröse et al. find that:

the inclusion of inaccurate or misleading information in at least an important part of the concept... occur in 28.6% of GPT-3.5 outputs, 17.8% of GPT-4o outputs, and 12.5% of o1preview outputs.

It will depend on your priors as to how good or bad you think those results are. I think the trend in improvement across versions of ChatGPT is instructive in terms of the improvement over time, and what we might expect from more recent versions. Looking at the types of errors that are made, Bröse et al. find that:

...over 75% of the errors across all models are factual errors. This means that information provided is incorrect based on established facts, either because models fail to retrieve accurate information from their training data, or facts are incorrectly assimilated during the learning process.

That wouldn't be so bad, except that:

...inaccurate responses are often open-domain (i.e., users cannot identify incorrect information without extensive research outside of the chat) in 76.67% of cases for GPT-3.5, 85.19% for GPT-4o, and 64.7% for o1preview. This changes slightly when focusing solely on responses with an accuracy score of 3 or lower... inaccurate responses are of the open-domain in 76.47% of cases for GPT-3.5, 91.91% for GPT-4o, and 50% for o1preview.

So, when ChatGPT makes errors in explaining concepts, it often does so in a way that the users cannot easily realise that an error has been made. To a large extent, that isn't surprising. If the error was obvious without extensive research outside the chat, the chances are that ChatGPT would not have made the error in the first place. However, it also highlights that it still pays for users to double-check ChatGPT's responses to questions like this, when they are not sure. In fact, I can report that several of my ECONS101 students this trimester have queried Harriet's responses to a question, and she has immediately realised that she made an error! There is good learning for both the student and the AI tutor from such an experience (or, at least for the student, because Harriet won't remember the interaction later - I just update her knowledge base to try and minimise future errors of the same type).

Bröse et al. then go on to report that ChatGPT generally gives a clear and accessible explanation, but that the responses often "lack important detail or nuance" (which they term 'scope'). That doesn't surprise me for a one-shot response to a question, and I think that Bröse et al. are being a little unfair in their assessment of ChatGPT on the scope dimension. If the user wants additional detail or nuance, then they really should ask additional follow-up questions, rather than expecting a detailed and nuanced treatise on a topic in a one-shot query.

Bröse et al. also note that, in relation to the quality of examples provided by ChatGPT:

66.07% of responses from GPT-3.5, 73.21% from GPT-4o, and 51.79% from o1preview received a score of 3 or lower, indicating that the examples provided were weak, unhelpful, or even detrimental to comprehension. An additional 25% of GPT-3.5 responses, 17.86% of GPT-4o responses, and 42.86% of o1preview responses attained a score of 4, suggesting that the examples were relevant but overly simplistic.

That has been my experience of ChatGPT as well, and looking at transcripts from students' conversations with Harriet (for students who have shared them with me), it does seem to be a general trend. However, again, using the simplistic example as a starting point and then engaging in a more detailed conversation as a follow-up does generally lead to better results.

Turning to multiple choice questions, Bröse et al. find that:

GPT-3.5 correctly assessed 67% of the options, GPT-4o achieved 91%, and o1preview reached 93%.

This provides further evidence, if any were needed, that online multiple-choice tests are not going to be an effective assessment tool ever again. Unless the goal is to hand out some free marks to students, of course.

Bröse et al. finish their paper by offering some advice for teachers:

  1. Integrate ChatGPT into economics courses by leveraging its strengths in explanations and question-answering while educating students on its capabilities and limitations...
  2. Teach students to use ChatGPT as a learning aid, not a replacement for textbooks, lectures, or their own effort in solving problems...
  3. Acknowledge that students will struggle to detect incorrect output...
  4. Shift classroom focus to application-based learning by leveraging a flipped classroom model and providing clear, detailed examples...
  5. Guide students to specify the question context as well as the chatbot’s role when prompting...
  6.  Refine your problem statements for easier comprehension...
  7. Verify translation accuracy before using ChatGPT in Non-English courses. 

The last piece of advice is interesting because they also evaluated ChatGPT in German and generally found that ChatGPT started by translating the query into English, answering it, then back-translating. Some of their other advice can be taken out of students' hands by the teacher creating their own AI tutor (as I have). Then, prompting and problem statements become less of an issue.

Overall, this paper is a good reminder of how much improvement there has been in large language models in just the last couple of years (ChatGPT-3.5 was released in November 2022). Even o1-preview is nearly a year old now, and has been supplanted. However, we need to remain mindful that large language models do convey an aura of certainty and accuracy in their responses that is not always warranted. At the very least, when in doubt, ask ChatGPT to check its answers!

Read more:

Sunday, 25 May 2025

Universities' (and teachers') cheap talk on generative AI and assessment

Generative AI should be changing the way that universities assess students. I say "should be", rather than "is", because it seems to me that a lot of teaching staff really have their head in the sand on this, continuing to assess in a very similar way, and simply attaching a warning label ("thou shalt not use generative AI") to each assessment, as if that will make a difference. The futility of that approach is the topic of this new article by Thomas Corbin, Phillip Dawson (both Deakin University), and Danny Liu (University of Sydney), published in the journal Assessment and Evaluation in Higher Education (open access).

Corbin et al. focus attention on the university level frameworks, but many of the things that they say apply equally to each paper. When considering how to approach the impact of generative AI on assessment, and how assessment needs to change as a result of generative AI Corbin et al. distinguish two approaches: (1) discursive changes, which involve telling students what is and what is not permitted; and (2) structural changes, which involve changing the assessment itself so that the way that students may use AI (or not) is specifically factored into the assessment.

Corbin et al. make the important point that:

...existing frameworks predominantly rely on merely discursive methods which introduces significant vulnerabilities related to compliance and enforceability, ultimately undermining assessment validity and institutional reputation. Although these systems may have value in other areas, for example by assisting teachers to conceptualise the different ways AI may be used in a task, from a validity standpoint any change which is merely discursive and not structural is likely to cause more harm than good.

Discursive approaches include 'traffic light' systems, or various assessment scales, where teachers communicate to students what generative AI use is or is not allowed. They also include requirements for students to disclose the use of generative AI in their assessments. The problem with discursive changes to assessment is obvious:

Without reliable detection mechanisms, prohibitions against AI use remain merely discursive. This technological limitation exposes a more fundamental issue with discursive approaches. That is, they rely entirely on student compliance with rules that cannot be enforced.

There is not reliable way of detecting generative AI use in student assessment. The best that teachers can do is to rely on vibes. Or when a student writes in their essay that they are 'delving' into a 'rich tapestry' or a 'multifaceted realm' and trying to find the 'intricate balance' or a 'symbiotic relationship'.

Corbin et al. instead advocate for structural changes, which they define as:

Modifications that directly alter the nature, format, or mechanics of how a task must be completed, such that the success of these changes is not reliant on the student’s understanding, interpretation, or compliance with instructions. Instead, these changes reshape the underlying framework of the task, constraining or opening the student’s approach in ways that are built into the assessment itself.

They illustrate with some examples, starting with:

A traditional take-home essay (asynchronous) provides students with ample opportunity to use AI without detection, regardless of what instructions are provided. In contrast, a supervised in-class writing exercise (synchronous) inherently limits AI assistance by its very structure.

Justin Wolfers would approve. However, Corbin et al. rightly note that:

This doesn’t mean that all assessment should become synchronous and supervised; certainly, asynchronous assessment has valuable benefits for developing certain skills. The key is aligning the assessment structure with what we genuinely want to measure. If we want to develop a student’s ability to think deeply and develop complex arguments over time, an asynchronous format may be appropriate, but we would need to build in structural assessment elements that capture the development process rather than just the final product.

Corbin et al. don't leave us hanging. Even though they can't solve all of our AI-related assessment issues, they do offer some suggestions:

First, structural changes frequently involve reorienting assessment from output to process. Rather than evaluating only the final product, which could potentially be AI-generated, assessment may be designed to capture the student’s development and attainment of understanding and skill over time. This might mean building in authenticated checkpoints where students must demonstrate their evolving thinking. For instance, rather than simply submitting a final essay, students might need to participate in live discussions about their developing ideas or demonstrate how their thinking evolved through structured peer feedback sessions...

Second, structural changes often involve viewing assessment validity at the unit or module level rather than the task level. Instead of trying to ensure each individual assignment is AI-proof (an increasingly futile endeavour), educators can design interconnected assessments where later tasks explicitly build on a student’s earlier work.

This relates back to two earlier posts of mine. This post talks about assessment specifically, while this post talks about changing the way that students interact with generative AI in learning and assessment tasks, so that they skills are scaffolded through their degree. We do need to make changes to assessment practices. It is possible for assessments to change in ways that take account of students' access to generative AI. It is not necessary to make generative AI use forbidden in all situations. It is probably equally unhelpful to make it 'open season' on generative AI use either. As with all things, there is a balance to be had, and university teachers need to find that balance.

Corbin et al. conclude that discursive changes to assessment:

...remain powerless to prevent AI use when they rely solely on student compliance. They say much but change little. They direct behaviour they cannot monitor. They prohibit actions they cannot detect. In other words, when it comes to appropriate assessment change for a time of AI, talk is cheap.

Simply using a set of written rules on when and how students can use generative AI is, at best, ineffective, and at worst, may actively harm student learning. Those rules are cheap talk. We can do much better.

[HT: Maria Neal]

Read more:

Saturday, 24 May 2025

ChatGPT and economics homework questions

Back in December last year, I briefly discussed why we eliminated Moodle quizzes from my ECONS101 assessment (we still have quizzes, but they are not for credit, and they happen every day - more on that in a future post). The problem with Moodle quizzes is that there are browser extensions that will automatically answer Moodle quizzes using generative AI. That makes Moodle quizzes largely a waste of time as an assessment tool (although I believe that they still have value as a learning tool).

Of course, Moodle quizzes are not the only assessment that have been rendered obsolete by generative AI. As Justin Wolfers notes, any high-stakes at-home assessment is now essentially worthless. But it's not just high stakes assessment. Problem sets or homework assignments are also affected. This new article by Rachel Faerber-Ovaska (Youngstown State University) and co-authors, published in the journal Bulletin of Economic Research (open access) asks the question, "Has ChatGPT made economics homework questions obsolete?"

Faerber-Ovaska et al. test the ability of ChatGPT to answer 1112 multiple choice and 186 long answer questions (which they call essay questions) from the question banks of the 2nd edition of Principles of Economics by Greenlaw and Shapiro. They find that:

The bot answered 67.63% of the 1112 multiple-choice questions correctly.

Faerber-Ovaska et al. then looked at the characteristics of the questions that ChatGPT got wrong, and report that:

The inclusion of tables or figures, as well as higher levels of difficulty, were found to significantly decrease the odds of ChatGPT answering correctly. For example, the model estimated odds of ChatGPT answering a question with a table correctly were only 0.45, corresponding to an 80% lower probability compared to a question with no table. Overall, the bot struggled with material from chapters requiring visual interpretation, such as supply and demand, elasticity, theory of the firm, and financial economics.

As for the long answer (essay) questions:

...we found that the bot scored higher for clarity than for content. The bot score for content was an A on 72.0% of questions, whereas for clarity, the bot scored an A on 93.5% of the questions. Overall, for essay questions, the bot’s responses earned an A in 72.0% of the questions and a B in 18.3% of the questions.

It is worth noting that Faerber-Ovaska et al. were testing ChatGPT 3.5, and more recent versions of ChatGPT (and other large language models) are likely to perform even better. Nevertheless, in answering the question they post, Faerber-Ovaska et al. conclude:

Have the economics homework and test questions we currently rely on been rendered obsolete by ChatGPT? The answer is: yes, as they are used now.

The "as they are used now" is important in that sentence. Economics teachers (and teachers in other disciplines) need to change the way that we do things. Homework may still be effective as a learning tool, but it will not be effective if the approach is simply to have students turn in (or complete online) homework problems that are copy-pasted from a large language model. For the moment, these models are not great at drawing accurate diagrams in economics, but that just means that economics has mere moments more time to adapt than some other disciplines. Homework may still have a place in student learning, but it needs to be structured in a way that ensures that students engage, even if they are using generative AI. In my ECONS101 and ECONS102 classes, our low-tech solution is to have students complete homework in handwritten form only. The homework is not graded, but is built on in-person in tutorials (and completion of the tutorials is worth a small amount of marks). This ensures that even if students are using generative AI, they need to engage in class as well. This is reinforced by an approach to assessment that is heavily weighted towards in-person invigilated assessment, where the use of generative AI is unlikely (for now!).

Homework isn't dead (in economics, or in general). But its role in student learning and assessment needs to change.

Read more:

Friday, 23 May 2025

This week in research #76

Here's what caught my eye in research over the past week:

  • Huseynov (with ungated earlier version here) finds that students reduce their confidence regarding their future earning prospects after exposure to AI debates, and this effect is more pronounced after reading discussion excerpts with a pessimistic tone (but will their actual experience match their expectations, pessimistic or otherwise?)
  • Pipke (open access) studies 7,000 soccer penalty shootouts and 74,000 kicks and finds no evidence of a first-mover or second-mover advantage in winning probability
  • Böheim, Freudenthaler, and Lackner (open access) find that women’s NCAA basketball teams with a male head coach are 6 percentage points more likely to take risk than women’s teams with a female head coach
  • Clemens and Strain (with ungated earlier version here) study the interplay between minimum wages and union membership, and find that each dollar in minimum wage increase predicts a 5 percent increase (0.3 pp) in the likelihood of union membership among individuals ages 16–40, which may explain why unions are in favour of minimum wage increases
  • Yang and Zhou (open access) find that, after controlling for quality as well as author-, paper-, and journal-specific attributes, publications in economics with a Chinese first author receive 14% fewer citations
  • Abdulla and Mourelatos (open access) find that Russian migrants are significantly less likely than local Kazakhs, local Russians, or Kyrgyz migrants to receive job interview invitations in Kazakhstan, based on data collected after the start of the Russia-Ukraine War

In other news, I had two articles published in The Conversation this week, on the New Zealand Budget:

  • This article, co-authored with Michael Ryan, discusses the difficulty of economic forecasting and why this 2025 Budget was particularly tight
  • This article gave my quick take on the Budget (since it was published only a couple of hours after the Budget was announced), as well as summarising the key Budget announcements in each area

Thursday, 22 May 2025

Costco buttering up New Zealand consumers

Overseas, Costco uses a variety of products as loss leaders, including rotisserie chicken (see here). Right now in New Zealand, it appears to be butter. As the New Zealand Herald reported yesterday:

It was organised chaos this week at Costco when another delivery of butter arrived.

This butter is not just any butter – while the other supermarkets are selling a 500g slab for up to $10, Costco’s butter is $9.99 for a kg...

Chris Schulz, a senior investigative journalist at Consumer NZ, said it looked likely that the Costco butter was a loss leader.

“The retailer’s Facebook page is flooded with people speculating when the butter might be back on shelves, debating when to visit, and showing off when they do get it.

“With butter costing at least $17 per kilo elsewhere else, Costco’s pricing makes them look like the ‘good guys’ in contrast to our supermarket duopoly. Once they’re in store, I’m sure many people are picking up roast chickens, cheese, and giant tubs of biscuits too.”

As I note in my ECONS101 class, the ideal loss leading product is one that has a high price elasticity of demand, and lots of complementary goods. Elastic demand means that a decrease in price will increase the number of consumers by a lot. So, loss leading will get a lot more consumers in store. And complementary goods are goods where lowering the price of one good causes the consumer to buy more of the other good. Most supermarket staples that are regularly purchased will be complementary goods, because consumers tend to buy them together on the same shopping trip. So, lowering the price of one causes the consumer to buy more of the other goods on their shopping list.

In this case, by selling butter at a loss (and it must be at a loss, because there's no way that selling butter at half the price of other retailers is profitable), Costco is able to attract many more consumers, who then buy other things that Costco can profit from. The Herald article offers some examples, including this one:

Kaleb Halverson decided to start making the trip from New Plymouth to Auckland to deliver Costco’s 1kg blocks of butter at $9.99 to customers across the Taranaki region.

He only had a few orders at first, but they kept rolling in...

He brings back everything the store has to offer, but said butter is definitely top of the list. “It’s our hot item; at the moment, every order has butter.”

Other popular products are cleaning products and snacks.

Costco sells the butter at a loss, and makes up for it with greater sales of (and profits from) cleaning products and snacks. 

Wednesday, 21 May 2025

Try this: Songs about economics

Time for a spot of leisure? Looking for something to listen to while unwinding? Try this Spotify playlist by John Hawkins (University of Canberra), titled "Songs about economics".

It has some obvious classics like "(I Can't Get No) Satisfaction" by The Rolling Stones, "Money, Money, Money" by ABBA, and "Money" by Pink Floyd. It even has some more contemporary songs like "Price Tag" by Jessie J., "Bills" by LunchMoney Lewis, and "7 Rings" by Ariana Grande. And a few songs that are quite obscure (at least, to me, since I had never heard them before), like "Offshore Banking Business" by The Members, and "Capitalism" by South Korean rapper Jvcki Wai.

Given the breadth of songs on the playlist though, there's some notable admissions. Why not include "Mo Money, Mo Problems" by The Notorious B.I.G.? Or "Money Trees" by Kendrick Lamar? Or even more obviously, "Minimum Wage" by They Might Be Giants? Or, given that Hawkins is Australian, "Blue Sky Mine" by Midnight Oil?

As a bonus, you should listen to this song, "There is No Depression in New Zealand" by Blam Blam Blam:


Enjoy!

[HT: John Hawkins, in this article in The Conversation about Spotify]

Tuesday, 20 May 2025

Black Mirror Season 7 illustrates the ultimate version of customer lock-in

[This post contains spoilers. You have been warned.]

I love the TV show Black Mirror. Charlie Brooker (the writer of almost all episodes of the show) is an evil genius. Nearly every episode depicts some dystopian near-future that is just plausible enough to make you both worry, and think. The first episode of the latest (seventh) season, titled Common People, is a perfect illustration of this. It is also a perfect illustration of customer lock-in, albeit at an extreme level. From the Wikipedia description of the episode:

Welder Mike Waters (Chris O'Dowd) and schoolteacher Amanda (Rashida Jones) have been married for three years and are trying to conceive a baby. One day while teaching, Amanda collapses, and doctors discover she has an inoperable brain tumor. Mike is introduced to Gaynor (Tracee Ellis Ross), a representative from tech startup Rivermind Technologies. Gaynor explains that Rivermind can remove the tumor and replace her excised brain tissue with synthetic tissue powered by their servers. While the surgery is free, the couple agree to pay a monthly subscription fee to give Amanda a chance at living a normal life again.

Initially the service seems to help Amanda, but as time passes they find that it has several limitations which can only be bypassed by subscribing to the costlier "Plus" tier, as opposed to their current "Common" tier. Unbeknownst to Amanda, she begins interjecting brief advertisements into her daily speech.

As I describe in my ECONS101 class, customer lock-in occurs when consumers find it difficult to change once they have started purchasing a particular good or service. High switching costs (the cost of switching from one good or service to another, or from one provider to another) are likely to generate customer lock-in, because a high cost of switching can prevent customers from changing to substitute products. High switching costs could also, in some cases, prevent consumers from stopping buying the good or service - that is, the switching cost causes the consumers to keep buying the good even if they would want to stop (if there was no switching cost). This is the case for subscriptions, for example (see here or here or here).

In this case, Rivermind appears to have discovered the ultimate form of customer lock-in. The switching cost that Mike and Amanda face if they try to cancel their Rivermind subscription is that Amanda dies (or becomes comatose - the episode is somewhat unclear on this point). That switching cost is obviously very high and provides a strong incentive for Mike and Amanda to keep their subscription going. They are locked into the subscription, which is quite expensive.

Rivermind doesn't just profit from Mike and Amanda through their subscription. Rivermind also engages in a form of multi-period pricing. Typically, firms engage in multi-period pricing by starting new consumers with a low price, and then raising the price once those consumers are locked in. This is what utility firms are trying to do when they offer a discounted rate for electricity or broadband for new customers (for a limited time!). The price is initially low, and then when the new customers are locked in, the price increases (because the discount ends).

Rivermind's approach is somewhat different to the standard case of multi-period pricing. Instead of directly raising the price of the service that Mike and Amanda receive, Rivermind degrades the quality of that service (by introducing advertising). Rivermind then introduces an advertising-free tier that is more expensive (which Mike and Amanda are invited to 'upgrade' to, even though tit is really just a more expensive price for the service they started with). Rivermind then also introduces more tiers of subscription with greater coverage and more perks (and even higher prices).

The Black Mirror episode focuses on the increasingly desperate ways in which Mike tries to keep the subscription going. However, my takeaway is that it illustrates how firms can lock consumers in with switching costs that are non-monetary, and then profit from those locked in consumers. Thanks Charlie Brooker - now you've given me something else to worry about in the dystopian near-future.

Read more:

Monday, 19 May 2025

The impact of Nobel Prizes and MacArthur Fellowships on the winners

What happens to the research productivity of winners of top research awards? On the one hand, a research award like a top fellowship or a Nobel Prize might increase a researcher's impact, as other researchers follow the path they have laid down. On the other hand, maybe there is some 'mean reversion', where a previously high-flying researcher simply returns to a less stellar research trajectory (which would look like a decrease in productivity). Or, perhaps a top research award grants a researcher the freedom to explore new, higher-risk areas of research, which could lead to much higher, or much lower, productivity overall?

The question of what happens to researchers after winning a top research award is addressed in this 2023 article by Andrew Nepomuceno, Hilary Bayer, and John Ioannidis (all Stanford University), published in the journal Royal Society Open Science (open access). They looked at the pre- and post-award citation counts for all 72 winners of the Nobel Prize in chemistry, medicine, or physics over the period from 2004 to 2013, and 119 of the 238 McArthur Fellows (only including those in STEM or social science fields) over the same years. Specifically, they compared publications published in two periods of three years: (1) the two years before the award and the year of the award; and (2) the three years after that. They counted citations for the pre-award period up to 2015, and the post-award period up to 2019 (so that both the pre-award and post-award periods had the same number of observed years of citations).

In their main results, Nepomuceno et al. report that:

Nobel Laureates and MacArthur Fellows received fewer citations for post-award work than for pre-award work... The difference was driven predominantly by Nobel Laureates while there was little difference, on average, for pre- versus post-award citation impact for MacArthur Fellows. The median decrease was 80.5 citations among Nobel Laureates and 2 among MacArthur Fellows. For Nobel Laureates, the decrease reached statistical significance (Wilcoxon signed-rank test p = 0.004), whereas for MacArthur Fellows the decrease was not statistically significant (Wilcoxon signed-rank test p = 0.857)...

Post-award citation impact was lower than the pre-award citation impact for 45 of 72 (62.5%) Nobel Laureates and for 63 of the 119 (52.9%) MacArthur Fellows.

Both Nobel Laureates and MacArthur Fellows suffered a reduction in the citation count per-publication after receiving their award, but for different reasons. The Nobel Laureates published the same number of papers in the period after the award as they did before the award. But their lower citations mean that the citation count per-publication was lower. In contrast, the MacArthur Fellows published more papers after the award than they did before the award, but with no change in total citations (again, meaning that the citation count per-publication was lower).

One major difference between the two groups is age - Nobel Laureates are much older than MacArthur Fellows. So, Nepomuceno et al. conducted further analyses stratified by age (in three groups: under 42 years old, 42-57 years old, and over 57 years old), and found that:

...the declining citations pattern was seen only for researchers who were 42 or older at the time of the award, while an opposite pattern was seen for early career researchers who were given an award (especially MacArthur award) at an age of 41 or younger.

However, looking at Table 2 in the paper, it is clear that the negative impact on total citations is largest for the youngest Nobel Prize winners (those aged under 42 years), but is negative for all three age groups. In contrast, there is a positive impact on citations for the youngest McArthur Fellows, and a negative impact for McArthur Fellows aged over 42 years.

Overall, Nepomuceno et al. conclude that:

Although the MacArthur Fellowship and Nobel Prize selection committees share a stated goal of assisting winners in realizing their potential more fully, in terms of citation counts neither the MacArthur Fellowship nor the Nobel Prize heralded increased research impact for the subsequent work and for Nobel Laureates there was even a significant decline.

It is tempting, then, to conclude that these awards are not a good idea. I'm not so sure. I think the research highlights different impacts of the two awards, and I think we learn something potentially important from this. Nobel Laureates may tend to rest on their laurels (pun intended), or may suffer from mean reversion. Or, perhaps they use the profile accorded by their new status as Nobel Laureates to try and have greater policy or political influence, with an opportunity cost of lower research influence. That suggests that it is better to award Nobel Prizes to end-career academics, lest younger academics be diverted from important and path-breaking research. The recent trend in awarding Nobel Prizes to younger recipients (definitely noticeable in economics) may therefore have a negative unintended consequence. In contrast, because there is a positive citation impact for young recipients, the MacArthur Fellowships should be targeted in greater proportion to younger researchers. There is less to be gained from awarding those Fellowships to end-career academics.

To be fair, that is more-or-less how those two awards have historically been allocated: Nobel Prizes to end-career academics, and MacArthur Fellowships ('genius grants') to young stars. This research suggests that might be an important practice to continue.

[HT: Marginal Revolution, back in 2023]

Saturday, 17 May 2025

The characteristics of the 'young stars' in economics

The top young 'star' economists represent the future leaders of the discipline. Understanding where they are coming from, what they are studying, and where they are going to, is therefore important. This 2019 article by Kevin Bryan (University of Toronto), published in the journal Economic Inquiry (ungated version here), provides a look at the 'young stars' in economics over the period from 2013 to 2018 (who are mostly likely to be newly tenured professors now, in 2025). [*]

Bryan focuses his sample of top economists, defined by those who received multiple 'flyouts' for academic job interviews at top universities, between 2013 and 2018. As he explains:

While applications and interviews are largely nonpublic, flyouts are often publicly posted on department seminar lists, and accepted offers are of course publicly viewable on the hired student’s vita... This suggests two possible definitions of a “star”: those who accept top offers, and those who are flown out to top places. The problem with the former is that one of the questions we would like to answer is where top students take jobs, and using the job accepted as a definition begs the question.

For this reason, our definition of a star is any economist within 8 years of beginning their PhD, who has never had a permanent job after graduating, and who has received a sufficiently large number of high quality flyouts... We begin with a list of the top 25 U.S. economics PhD programs in the U.S. News 2013 rankings, then add eight top business schools which frequently hire economists in nonfinance positions, Harvard Kennedy’s policy program, and 10 European and Canadian programs which regularly fly out top junior candidates.6 For each of these 44 programs, we gathered flyout lists from departmental seminar websites each year between 2013 and 2018, and augmented these with e-mail requests to departments which do not post flyouts publicly... We then assign consistent weights to a flyout at each program, with more prestigious flyouts receiving more weight, and consider a star any student who receives sufficiently many weighted flyouts...

This results in a sample of 226 'young stars' in economics, and Bryan summarises where those students are from, what they have been doing, and where they ultimately went to, using data collected from the students' CVs, job market papers, and LinkedIn. First, in terms of background, Bryan reports that:

The 226 star students come from 40 countries, of which 35% are American, 35.4% are European, and the remainder are from the rest of the world.

Notably, there was one student from New Zealand in Bryan's sample. I wish I knew who it was, but honestly, I have no idea! Then, in terms of where they graduated, Bryan notes that:

While the national origins of star students are diverse... the PhD program diversity of students is less so. Totally 47% of star students come from only five PhD programs, and 84.5% came from only 11 universities, including students from all programs at those schools. Only 9.3% of stars—21 total—did their PhD outside the United States.

Those top five PhD programmes were MIT (31 students out of 226), Harvard (25), Princeton (18), Yale (16), and Stanford (15). The highest non-US institution is London School of Economics, with eight students. Turning to gender, Bryan reports that:

...only 20.4% of star students are female, a percentage never exceeding 25% in any of the 6 years in our sample.

This is not great news, although Bryan notes first that this reflects the pipeline at top universities:

In the 2018 cohort, among the 11 programs that historically produce the most star students, there are 187 men and 50 women listed on those programs’ job market websites. That is, only 21.1% are female.

Bryan then notes in a footnote that:

Although 2019 data is preliminary at publication time, and hence not included in the overall analysis, there is a stark difference in that cohort: 20 of 43 stars, or 46.5%, are female.

So, perhaps there is some evidence of a balancing of genders in top programmes, and among 'young stars', although we would need more than just one year of data to support that conclusion. Moving on to the question of what 'young stars' study, Bryan finds that:

...the most striking fact is that job market stars almost universally studied economics or a technical field as their undergraduate degree... Over 75% of all job market stars have an undergraduate degree in economics, and nearly 95% have an undergraduate degree in either economics or a technical subject (mathematics, statistics, operations research, physics, or engineering).

That might be a striking fact, but not at all surprising. Interestingly though, there are different pathways to the PhD for American and non-American stars:

...34% complete their PhD within 6 years of their first tertiary degree. Americans are slightly more likely to do so, and men as well, though the differences are statistically insignificant... the reasons why Americans and non-Americans do not go straight from their undergraduate to PhD work are very different—Americans work, often as RAs, and non-Americans study at the master’s level—but the net effect is that both groups delay going “straight through” from undergraduate study at a similar rate.

In terms of field of study for their PhD job market paper, Bryan notes that:

...when we concatenate subfields into the broad categories of “applied micro,” “macro,” and “micro and econometric theory,” applied micro is the primary field of 45.6% of stars. There is no time trend...

There are large differences in field between male and female stars. Over 67% of female stars have applied micro as their primary field; only 40% of men have the same (Fisher exact test: p < .005). This difference is largely driven by the overrepresentation of women among stars in development and labor. On the other hand, in the broad definition of macroeconomics, in which we include growth, monetary, pure macroeconomics, finance, international and political economy, there were only seven female stars over 6 years, representing barely 10% of macro stars in that period.

This gender difference in areas of specialisation within economics is well known (for example, see here). However, I thought this bit was surprising: 

Publications prior to the job market are not a necessary condition for stars... 51% have a publication or an R&R [Revise and Resubmit]... That said, the flip side of this statistic is that half of job market stars have no publication or R&R at all, and 80% do not have a top five publication or R&R.

That really does mean that 'young stars' are being hired on the strength of their networks, and whatever they can convey through an interview process, rather than the signal provided by high-quality publications. And: 

Looking at heterogeneity in publishing, female stars are 32% less likely to have a publication or R&R than men (Fisher exact test: p < .05).

That result is interesting, and I'm unsure how to interpret it alongside the other gender differences. Could this simply reflect that female economists take longer to get published (as this paper suggests)? Or that, among potential young stars with no publications, universities are more likely to flyout a promising female applicant? Either of those could be the case, and it would be interesting to see if this result holds up in more recent cohorts, and dig into what might explain it.

Finally, Bryan looks at where they 'young stars' go, reporting that:

A total of 64.2% of all stars take a job at a U.S. economics department, and 47% of stars go to the top 15 departments alone. Another 21.7% accept jobs at a U.S. business school, almost always at a top 10 school.

That is not surprising, although this may be:

The only star student in our sample who went to the private sector went as a postdoc, and has since returned to academia.

I suspect that more recent cohorts have larger numbers attracted to private sector tech jobs, although it is also possible that 'young stars' still have a strong preference for academia, and it is the next tier down of PhD graduates who end up in the private sector. And Bryan cites some research to support that interpretation. Finally, Bryan turns his attention to postdocs, noting that:

Fourteen students, or just over 5%, became a star on the market following a postdoc... That is, not only is it not necessary to do a postdoc before being competitive for top permanent jobs in economics, it is in fact rare to do so.

I found that quite interesting, and a little surprising that more students didn't use a postdoc as a 'finishing school' or a way of getting a head-start on publications before the tenure clock starts. Again, perhaps that is more of a feature for the next tier of PhD graduates, and the 'young stars' are less affected?

Anyway, these 'young star' economists will almost all now be tenured professors, and no doubt form the core of the next generation of 'senior star' economists. It would be interesting to see some follow-up research on more recent cohorts, especially to investigate whether there have been any changes in the gender balance and gender differences within these top emerging economists.

*****

[*] I'd love to argue that the six-year gap (which nicely aligns with the tenure clock) between publication and my reading of this article was purposeful. But really, as regular readers of this blog might have noticed, I'm running through a bunch of papers I set aside in 2019 to read, and never got to (thanks in large part to pandemic-related teaching workload).

Friday, 16 May 2025

This week in research #75

Here's what caught my eye in research over the past week:

  • Gavresi, Litina, and Tsiachtsiras (open access) look at how motorway and railroad length impacts interpersonal and political trust, and find that infrastructure enhances trust by promoting mobility and exposure to new people and ideas, as well as by elevating political trust as the government is perceived as more reliable and effective
  • Arai and Okazawa (open access) find that being the first contestant is favourable in a Japanese television comedy show
  • Adamson and Fitzsimmons (open access) construct and analyse a database of warfare around the Mediterranean from 600 to 30 BCE, and find that there was no democratic peace among Ancient Greek city-states and mixed results, both inside and outside of Greece, about how war relates to state power
  • Zhou et al. review the last 17 years of research on the impact of artificial intelligence on the labour market

Wednesday, 14 May 2025

An interesting paper about the first 50 years of Nobel Prize winners in economics

The first Nobel Prize in economics was awarded in 1969, to Ragnar Frisch and Jan Tinbergen. The fiftieth prize was awarded in 2018, to William Nordhaus and Paul Romer. In total up to that point, there had been 91 Nobel laureates in economics. This 2019 article by Allen Sanderson (University of Chicago) and John Siegfried (Vanderbilt University), published in the journal The American Economist (ungated version here), reviews those first fifty awards. In addition to summarising the topics, Sanderson and Siegfried collate a lot of interesting factoids, starting with the origins of the award:

The 1895 will of Swedish scientist Alfred Nobel specified that his estate be used to create annual awards in five categories—physics, chemistry, physiology or medicine, literature, and peace—to recognize individuals whose contributions have conferred “the greatest benefit on mankind.” Nobel Prizes in these five fields were first awarded in 1901...

In 1968, Sweden’s central bank, to celebrate its 300th anniversary and also to champion its independence from the Swedish government and tout the scientific nature of its work, made a donation to the Nobel Foundation to establish a sixth Prize, the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel...

Sanderson and Siegfried summarise the backgrounds of the laureates, which are mostly unsurprising (a lot of top universities, and a lot of economics and mathematics), although:

Some notable surprises include Middle Tennessee State Teachers College... (James Buchanan) and South Dakota State University (T. W. Schultz).

And apparently, Eugene Fama's undergraduate degree was in romance languages! Sanderson and Siegfried also note that:

Economics joins literature and peace as the Nobel fields that have generated the most controversy. First, as a well-known quip has it, “economics is the only field in which two people can share a Nobel Prize for saying opposing things.” The 1972 Prizes awarded to Myrdal and Hayek spring to mind, as would the 2013 awards to Fama and Shiller...

I have often been tempted to create an assessment for my ECONS102 class to name and justify the best economist (living or dead, but eligible when living) never to have won a Nobel Prize. Sanderson and Siegfried provide their own list of economists who died before the economics Nobel Prize existed (but after Nobel Prizes were first awarded in 1901), which includes Leon Walras (who died in 1910), Vilfredo Pareto (1923), Alfred Marshall (1924), Thorstein Veblen (1929), John Bates Clark (1938), John Commons (1945), John Maynard Keynes (1946), Irving Fisher (1947), Joseph Schumpeter (1950), John von Neumann (1957), Arthur Pigou (1959), and Karl Polanyi (1964). That seems like a reasonable list to me.

Sanderson and Siegfried also provide a further list of economists who died after 1969 but never received a Nobel Prize, but could have done, which includes Frank Knight (died in 1972), Alvin Hansen (1975), Oskar Morgenstern (1977), Joan Robinson (1983), Piero Sraffa (1983), Fischer Black (1995), Amos Tversky (1996), Zvi Griliches (1999), Sherwin Rosen (2001), John Muth (2005), J.K. Galbraith (2006), Anna Schwartz (2012), and Martin Shubik (2018). Sanderson and Siegfried then add:

To this list, one could certainly add more of their contemporaries, for example (in alphabetical order), Anthony Atkinson (2017), William Baumol (2017), Harold Demsetz (2019), Evsey Domar (1997), Rudiger Dornbusch (2002), Henry Roy Forbes Harrod (1978), Harold Hotelling (1973), Nicholas Kaldor (1986), Jacob Mincer (2006), Hyman Minsky (1996), and Ludwig von Mises (1973), among many others.

I would agree with many of those from both lists, especially Robinson, Baumol, Demsetz, and Hotelling. It is worth noting that Fischer Black would almost certainly have shared the 1997 Nobel Prize with Myron Scholes (and Robert Merton), while Amos Tversky would almost certainly have shared the 2002 Nobel Prize with Daniel Kahneman (and Vernon Smith). There have also been surprising near misses in each direction, one of which was William Vickrey, who died three days after the award was announced (and therefore some months before the award ceremony). The other notable near miss was where:

Polish macroeconomist Michal Kalecki was nominated for the Nobel Prize in 1970 but died in April of that year...

Sanderson and Siegfried wisely steered clear of suggesting potential future winners (after 2018 when their sample ends). Nevertheless, the article is a great summary of the first 50 years of the Nobel Prize in economics, and well worth a read. 

Sunday, 11 May 2025

Why study economics? US graduate earnings edition...

One of the factors that should affect student decision-making is post-graduation income. Having invested a lot of time and effort (and money) in studying, that investment needs to pay off in some way. Higher lifetime income may not be the benefit of studying, but it is an important one. And so, majors that offer higher incomes might be more attractive.

The good news for economics graduates is that economics consistently rates as one of the majors that offers the biggest return on investment (whether measured in terms of lifetime earnings, or in terms of earnings X years after graduation). There is plenty of evidence to support this (browse through some of the links at the end of this post for some examples). The latest evidence, in the context of recent graduates from the US comes from this blog post by Marisol Cuellar Mejia and Hans Johnson. They looked at recent graduates, aged 22 to 27 years, using data from the American Community Survey. Their findings are neatly summarised in this figure (for the interactive graphic version, go to the blog post):

Notice that economics graduates are the third-highest earners within that age group among the 'top ten majors', behind computer science and nursing, ahead of business and management, and far ahead of sociology or psychology. The median earnings of an economics graduate in that age group was US$73,000 per year. Extending out to other majors beyond the 'top 10', economics is also behind engineering (electrical engineering, computer engineering, and mechanical engineering), and equal with finance (although the highest earners with finance majors earn more than the highest earners with economics majors).

Of course, there are some limitations to the data, and some fields (medicine, law) require graduate degrees in the US, so they aren't captured for comparison. However, it is clear that economics can be a rewarding career in monetary terms. And as a bonus, economics offers really interesting jobs as well (see some of the links at the end of this post).

[HT: Susan Olivia]

Read more:

Saturday, 10 May 2025

Book review: Gut Feelings

The ideal of a perfectly rational decision-maker, making their decisions based on all of the available information, is an impossible ideal. No decision-maker would ever be able to make a decision, as there would always be more information to collect, more aspects of the decision to weigh up, and new alternatives that become available. The reason that people can make decisions at all is because we are not perfectly rational decision-makers. A whole field of behavioural economics has risen around showing the various ways that heuristics (rules of thumb) and biases affect real-world decision-makers. These heuristics and biases are ways that real-world decision-makers unconsciously deal with excess information in decision-making.

In a 2007 book I just finished reading, entitled Gut Feelings, Gerd Gigerenzer focuses on a class of heuristics that he labels intuitions, or gut feelings. In the book, Gigerenzer invites us:

...on a journey into a largely unknown land of rationality, populated by people just like us, who are partially ignorant, whose time is limited and whose future is uncertain.

Gigerenzer positions the book as an antidote of sorts to a general perception in economics (and other social sciences) that decision-makers make decisions in a conscious and thoughtful way. Gigerenzer prefers that we consider decision-making to be a complex and adaptive process, affected by context and the environment, and where unconscious decision-making, which can only be rationalised after the fact, comes to the fore. He notes that:

Generations of students in the social sciences have been exposed to entertaining lectures that point out how dumb everyone else is, constantly wandering off the path of logic and getting lost in the fog of intuition. Yet logical norms are blind to content and culture, ignoring evolved capacities and environmental structure. Often what looks like a reasoning error from a purely logical perspective turns out to be a highly intelligent social judgment in the real world. Good intuitions must go beyond the information given, and therefore, beyond logic.

Gigerenzer presents a large number of examples of decision-making where intuitive answers prove to be better, or at least as good as, the outcome of more considered logical decisions. He collects these together as different heuristics like the 'gaze heuristic' (we reveal our preferences by things we look at for longer), or 'fast and frugal trees' (which collapse complex decision-making trees to a sequential decision based on a hierarchy of aspects of the alternatives). It is a compelling story. However, it falls into the same trap that a lot of behavioural economics does. Although Gigerenzer is able to provide a large number of specific (and to some extent idiosyncratic) examples of cases where intuition proves to be effective, he is unable to provide guidance on when our intuition should be followed, and when it might steer us astray. Gigerenzer seems to steer the reader away from even considering that intuition might be wrong, writing in the concluding section that:

The quality of intuition lies in the intelligence of the unconscious: the ability to know without thinking which rule to rely on in which situation.

I like to think that I have good intuition, in some situations. Certainly not in all situations, and that is what I worry about. When is my intuition causing errors in decision making. In saying this, I am drawing on the next book I am reading, Noise by Kahneman, Sibony, and Sunstein (and which I will post a review on soon). My early takeaway from the latter book is just how 'noisy' and error-prone decision-making is, and it seems to me that understanding the circumstances under which intuition is prone to error would be helpful in deciding whether to stick with the intuitive answer, or consider our decisions a little more fully.

The latter sections of Gigerenzer's book show how the ideas apply to moral behaviour and social instincts. These sections were interesting, but I felt like they were someone distant from the overall narrative. However, the book overall is interesting, even if it was not as closely aligned with behavioural economics as I expected it to be. And even though it is now somewhat dated, it still provides a prompt for some thoughtful reflection on how we make decisions.

Friday, 9 May 2025

This week in research #74

Here's what caught my eye in research over the past week:

  • Nguyen (with ungated earlier version here) finds that working hours per week are reduced by 0.38% for a 1% increase in the minimum wage in Vietnam
  • Nasser and Deutscher find a 15 percentage point decrease in attendance at women's football matches in Germany when there are overlapping match days between women’s and men’s matches, with a slightly larger effect for matches of the same team for women’s and men’s matches (for spectators, women's football and men's football are substitutes)

Also new from the Waikato working papers series:

  • Luengo et al. investigates the effect of continuous, free-form communication among traders on mispricing in experimental asset markets, and find that, contrary to expectations, communication has limited effectiveness, with only a slight reduction in mispricing observed in the most complex asset scenario

Thursday, 8 May 2025

The disturbing lack of impact of comments and replications of economics research

Back in 2018, I wrote a comment on an article that I was the reviewer for. The article (open access) described four types of 'economic citizen'. My comment (open access) argued that one of the four types was not distinct from the other three, but instead captured a different dimension of economic citizenship that might apply to any of the other three types. The comment was published alongside the article, along with a reply (open access) by the original article's authors, in the journal Education Sciences. What has happened since is that, according to Google Scholar, the original article has been cited 23 times. The comment has been cited just one, by the authors' reply, which has itself never been cited.

Now it may be that my critique of the original article is not valid, and so subsequent authors citing the original article don't feel the need to cite the comment. Or, it could be that the critique is valid (I still think it is), but is being ignored for some reason. If this is a common experience for research, then it calls into question whether research is self-correcting or not. If research findings are not easily overturned or challenged, then future researchers may be wasting a lot of time and effort on research that will not advance the field, being based on questionable earlier research.

How big a problem is this? That is the question addressed in this new article by Jörg Ankel‐Peters, Nathan Fiala, and Florian Neubauer (all Leibniz Institute for Economic Research), published in the journal Economic Inquiry (open access). Ankel‐Peters et al. look at all 56 replications published as comments between 2010 and 2020 in the American Economic Review, arguably the overall top journal in economics (and certainly in the top five journals). As they explain:

For the self‐correction claim to hold, we hypothesize that a comment should lead to a strong reaction of the literature, especially for a comment raising substantive concerns about an OP. If it does not respond strongly, we argue, the prior in the literature sustains. We look at two facets of a strong response: (1) Citations of the comment relative to citations to the OP after comment publication (henceforth: citation ratio), and (2) Whether the comment affects the OP's annual citations.

Ankel‐Peters et al. don't conduct formal statistical tests, and instead rely on a descriptive analysis. Nevertheless, the results are compelling:

We find that AER comments do not affect the OP's [Original Paper's] citations and hence their influence on the literature. We observe an average citation ratio of 14%. Comments are cited on average seven times per year since their publication—compared to an average of 74 citations per year for the OP since publication of the comment. Comments are, hence, not cited much in absolute terms, and a lot less than the OP. The latter implies that most OP citations ignore the comment.

They also find very similar results when they focus on citations in the top five economics journals, and similar results when they account (subjectively) for whether the comment 'must be cited' alongside the original paper because of how substantive the concerns raised are. Ankel-Peters et al. conclude that:

We interpret this as evidence for the absence of self-correction mechanisms in economics.

Obviously, that is disappointing. Replications are seen as important to ensuring the integrity and credibility of research (for example, see here). If papers that have failed to replicate, or that have substantive problems with them, are being cited uncritically in the following literature, then economics research could easily be led down some dead-end paths. What can be done to mitigate this problem? Ankel-Peters et al. don't offer much in the way of solutions. However, the bare minimum would be that comments and replications should receive more prominence so that readers of the original research are aware of them. To this end, Ankel-Peters et al. report a very small victory:

In response to a previous version of our paper, the current AER editor has let us know that the journal has changed its policy and now, for new comments, will provide a link on the OP's website. This is a small but perhaps important first step to giving replication work in economics the attention it needs and deserves.

Indeed. Now all journals need to follow AER's lead (which isn't much of a lead - many journals already do that). A better solution would be if top journals required citation of relevant high-quality comments or replications whenever an original paper is cited. And it would be even better if top journals published more high-quality replications and comments, as well as more high-quality systematic reviews and meta-analyses. Then we might advance economics research in a more informed way.

Tuesday, 6 May 2025

The effect of the Dobbs US Supreme Court decision on medical school applications

In 2022, the US Supreme Court ruling in Dobbs v. Jackson Women's Health Organization effectively overturned the famous 1973 Roe v Wade decision, returning the decision about abortion policy to individual states. Many states had laws that would ban abortion if Roe v Wade was ever overturned, and those laws immediately went into effect. Other states responded by enacting legislation to protect the availability of abortion. Overall, the landscape for medical care in the US changed overnight, with different effects in different parts of the country.

How did that change affect the incentives to become a doctor? That is effectively the question addressed in this 2024 article by Joshua Hess (University of South Carolina), published in the journal Economics Letters (ungated earlier version here). Hess looks at the effect of the decision on medical school applications, focusing on the change between years before 2022 (before the Dobbs decision) and 2022 and 2023 (after the Dobbs decision), for states with and without abortion restrictions after the Dobbs decision. In other words, Hess applies a difference-in-differences analysis (evaluating the difference between states with and without abortion restrictions on their books, between the time before and the time after the Dobbs decision, when those restrictions would become effective).

Looking at the share of women applying to medical schools, Hess finds that:

Women’s share of applications to treated schools increased by 0.65% (p-value = 4.8×10−06) relative to control schools in 2022 and by 1.17% (p-value = 3.2 × 10−06) in 2023.

So, Hess finds that women make up a larger share of medical school applications in treated states. I find that a little surprising. But then things get a little weirder, because:

...there is little change in total enrollments over the time frame. It increased from 21,507 in 2018 to 22,845 in 2023. Consequently, changes in women’s (men’s) share strongly imply a similar change in the number of women (men) medical students.

In other words, taking those two results together, states that have abortion bans must have gained more medical school applications from women and fewer applications from men, after the Dobbs decision (and relative to states without abortion bans).

It is difficult for me to understand the decision-making of medical school applicants here. However, I can offer some idle speculation. In states with abortion bans, the status of health care as a profession declined after the Dobbs decision. Moreover, the status of women's health care declined even more, making women's health less attractive as a specialty, compared with other specialties. There is some weak supporting evidence for this, as Hess finds that:

...the share of women applicants to top OB/GYN schools in states with abortion bans increased by nearly 1% more than not top OB/GYN schools in states with abortion bans...

OB/GYN is a medical specialty that attracts more women than men, and the larger effect there than overall is consistent with reducing the status of women's health care. However, if the decrease in the status of health care as a profession was explaining these results, then the total number of medical school applications should decline, when Hess notes that it hasn't changed. Instead, more women are being attracted to medical school, and fewer men.

Are the outside options (the next best alternative to medical school) better for men than for women? I don't know the answer to that question, but even if that were true it would only explain the decrease in male medical school applicants, and not the corresponding increase in female applicants.

That leaves me with one last possible explanation: that women responded to the threat to women's health in states with abortion bans by deciding to become doctors. That would be an effective way of fighting to protect women's health, a form of resistance. So, even as men become less likely to apply to medical school due to the reduction in status of medicine (especially in women's health care), women respond by applying to medical school in greater numbers.

Hess isn't able to extricate the reasons underlying the results he finds. So, my takeaway from this paper is that it raises more questions than it answers. Clearly, this is not the final word on this topic, and hopefully there is other research going on that will help us to better understand why there has been such a gender shift in medical school applications.

Monday, 5 May 2025

The mental health of economics PhD students and staff in Europe

Back in 2021, I wrote a post about the mental health of PhD students in economics. It was based on two studies and this Substack post by Scott Cunningham. The conclusion was that economics PhD students were suffering, but perhaps no more so than PhD students in other disciplines. However, patting ourselves on the back for being no worse than any other discipline seems like a failure to me, especially when many students are genuinely in mental health crises.

The study in Cunningham's post that was focused exclusively on economics PhD students was US-based, so it is worth wondering if the results apply elsewhere. This new article by Elisa Macchi (Brown University) and co-authors, forthcoming in the American Journal of Health Economics (ungated earlier version here), provides an answer to that, being based on data from 14 top European economics departments. The study uses a similar methodology to the US study that Cunningham discussed, and two of the authors (Valentin Bolotnyy and Paul Barreira) are the same. So, these studies are about as comparable as they can get. However, this new study also looks beyond PhD students, also considering the mental health of staff in economics departments.

Specifically, Macchi et al. got survey responses from 556 students and 255 staff, from 14 universities across Europe:

...Bocconi University, Bonn Graduate School of Economics, Central European University, European University Institute, London School of Economics, Mannheim Graduate School of Economics, Paris School of Economics, Sciences Po, Stockholm School of Economics and Social Sciences, University College London, Universitat Pompeu Fabra, University of Warwick, University of Zurich, and Uppsala Universitet.

It's worth noting that most of the top-ranked European economics departments are included in that list. Notable exceptions are Toulouse School of Economics, Oxford University, and Barcelona School of Economics (all ranked in the top ten in Europe, according to RePEc). Macchi et al. explain that the "restricted our interest to Economics departments that offer a cohort-style PhD program, where graduate students are admitted in cohorts to a graduate school, rather than following a chair-style model. That might explain the exclusion of other top universities from the sample.

The surveys were quite detailed, and in terms of mental health they included commonly used measures of depression, anxiety, suicidality, loneliness, and 'imposter phenomenon'. The last of these deserves a bit more explanation, and Macchi et al. note that imposter phenomenon:

...is a condition in which one feels like a fraud and worries about being found out. Individuals experiencing imposter phenomenon do not believe that their success is due to their competence, but rather ascribe success to external factors such as luck. Those experiencing imposter phenomenon often experience fear, stress, self-doubt, and discomfort with their achievements. Imposter fears interfere with a persons ability to accept and enjoy their abilities and achievements, and have a negative impact on emotional well-being...

Many PhD students (and indeed, many academic staff) can probably relate to that. Given the range of measures employed, the two samples (students and staff), and the comparisons with the US sample (where enabled by the use of the same questions), the paper has a huge amount of detail, and so it's difficult to excerpt from. The relevance of the comparisons with the US are somewhat limited because Macchi et al. conducted their survey starting in November 2021, when many people were still feeling the mental health impacts of the COVID-19 pandemic. However, Macchi et al. note attempt to establish how much of the difference in results (for depression and anxiety) relate to the pandemic.

The headline results are that there are:

...high rates of depression and anxiety symptoms, as well as suicidal or self-harm ideation, loneliness, and imposter phenomenon among graduate students in European Economics departments. 34.7% of graduate students experience moderate to severe symptoms of depression or anxiety and 17.3% report suicidal or self-harm ideation in a two-week period. 59% of students experience frequent or intense imposter phenomenon.

And in comparison with the US sample:

The prevalence of severe and moderate depression and anxiety symptoms in our sample of European Economics graduate students is notably higher than in the 2017-2018 sample of graduate students from top Economics departments in the U.S. (Bolotnyy, Basilico, and Barreira 2022) and higher than in a meta-analysis of depression, anxiety, and suicidal ideation among PhD students prior to the COVID-19 pandemic (Satinsky et al. 2021).

The Satinsky et al. paper is the other research that Cunningham referred to in his Substack post that I mentioned earlier. So, European PhD students have worse mental health that US PhD students. However, how much of that is due to the pandemic? Macchi et al. use data on the trends in mental health among Harvard University students, and note that:

...we can attribute approximately 74% of the difference in the prevalence of moderate-severe depression and 30% of the difference in the prevalence of moderate-severe anxiety between our European sample and the 2017-2018 U.S. sample to the impact of the COVID-19 pandemic.

So, the differences in mental health were not entirely driven by the pandemic. European PhD students do indeed appear to suffer more from depression and anxiety than US PhD students. What about staff though? Macchi et al. find that:

In our faculty sample, the prevalence of severe and moderate anxiety is on average lower than graduate students as well as than comparable statistics for the post COVID-19 European population. This average, however, hides a substantial heterogeneity by seniority level. Untenured tenure-track faculty in Europe are as likely to experience depression and anxiety symptoms as graduate students in our sample, and non-tenure track faculty show even higher prevalence of depression or anxiety symptoms. In contrast, the prevalence of depression and anxiety symptoms among European tenured faculty in our sample is about 70% lower than among their graduate students and is well below the comparable rates in the post-pandemic European population.

That makes a lot of sense. Un-tenured junior academics face many of the same workload and other pressures that PhD students do. Senior and tenured academics do not. So, it shouldn't be a surprise that there is a demonstrable difference in mental health measures between junior and senior academic staff.

Macchi et al. then turn to other results from their survey, showing that:

...25.9% of students in the European sample report having experienced at least one form of sexual harassment. Excluding a form of harassment not included in the U.S. study, the sexual harassment prevalence rate in our European graduate student sample (19.5%) is comparable to the U.S. sample (19.4%).

Again, that is not good. And worryingly:

...European Economics PhD students with moderate-severe symptoms of depression or anxiety are less likely to be in treatment (19.2%) than Economics PhD students in U.S. top departments (25.2%).

That difference in access to treatment may explain some of the differences in mental health between European and US PhD students. That also leads to the first of several recommendations that Macchi et al. make (which I think should be read alongside the recommendations that Bolotnyy et al. made for the US study, which I outlined in this post). Macchi et al. recommend that: (1) the usage of mental health services by students and staff be normalised and enabled; (2) that sexual harassment be addressed; (3) that relationships between students and their advisors be improved; and (4) more structure be offered in PhD programmes to avoid students getting into ruts. I think we can and should support all of those recommendations, and they're certainly something that would help PhD students, not just in Europe and not just in economics, but more generally.

[HT: Marginal Revolution, back in 2023]

Read more: