Monday, 2 June 2025

Some better news for AI tutors as a substitute for human tutors

Saturday's post made the case that we aren't there yet for AI tutors as a substitute for human tutors. However, there have been some success stories (see this post, for example). And in another example, this new working paper by Martín De Simone (World Bank) and co-authors shows that AI tutors can be very effective (and cost-effective) at teaching English in Nigeria. Specifically, the study:

...analyzes the effects of an after-school program in which students interacted with a large language model twice per week to improve their English skills, following the national curriculum. The intervention was implemented in Benin City, Nigeria, using Copilot, an LLM powered by the GPT-4 model at the time of implementation... The program was implemented over a six-week period between June and July 2024, targeting first-year senior secondary school students, who are typically 15 years old...

In the first session, teachers familiarized students with Microsoft Copilot, emphasizing both its educational benefits and potential risks, such as over-reliance on the model and the possibility of hallucinations and biased outputs. The goal was to foster responsible usage, encouraging students to complement their learning with the AI tool while retaining critical thinking skills.

Each subsequent session focused on a topic from the first-year English language curriculum, aligned with the material that students covered during their regular classes. The sessions began with a teacher-provided prompt, followed by free interaction between the student pairs and the AI tool...

The lesson guides and their prompts were carefully crafted to position the LLM as a tutor, focusing on facilitating learning rather than simply providing direct answers. 

The programme was run in nine schools, over a six-week period. The 'teacher-provided prompt' ensures that the students remain on-task, and is similar to the 'AI tutor as a substitute for a human tutor' that I discussed in Saturday's post. Unlike the research I discussed in that post, De Simone et al. are not interested in the fidelity of the AI model in keeping to the questions and answers it was provided. Instead, they look directly at student learning (which is what really matters).

Each student in the nine schools was invited to participate, and those that agreed were randomised to receive the AI tutor, or not. This randomised controlled trial (RCT) should give us high confidence in the results. Even though there was selection into the sample, the randomisation happened after the selection, so the results hold at least within the group of students who are willing to participate (which was 52 percent of eligible students). The results were dramatic:

First, we show that students selected to participate in the program score 0.31 standard deviation higher in the final assessment that was delivered at the end of the intervention. We find strong statistically-significant intent-to-treat (ITT) effects on all sections of that assessment: English skills (which included the majority of questions, 0.24 σ), digital skills (0.14 σ), AI skills (0.31 σ) and an Item Response Theory (IRT) composite score of each student’s exam (0.26 σ). We also show that the intervention yielded strong positive results on the regular English curricular exam of the third term.

So not only did the students perform better at the end of the RCT, that better performance carried through to better performance on a more general exam at the end of the term. It wasn't all good news though, as the results may increase learning inequality:

Treatment effects were positive and statistically significant across all levels of baseline performance, but stronger among students with better prior performance. Similarly, treatment effects were positive and statistically significant over the entire distribution of a proxy for socioeconomic status, but stronger among students with a higher one.

On the other hand:

...treatment effects were stronger among female students, compensating for a deficit in their baseline performance.

Still, finding strong positive effects for all groups of students is an important result, and the reduction in gender differences in English capability is important in this context. De Simone et al. then undertake a cost-benefit analysis, finding that:

...the program was highly cost-effective. The six-week pilot generated learning gains that take between 1.5 and 2 years in a business-as-usual scenario. The program achieved 3.2 equivalent years of schooling (EYOS) per $100 invested, surpassing many comparable interventions... When benchmarked against evidence from both low- and middle-income countries, the pilot program ranks among the most cost-effective solutions for addressing learning crises.

It is hard to argue against strong RCT evidence that is so positive in its impact. However, it is important to remember that context matters. This study was conducted with secondary school students, learning English, in Nigeria. The results are unlikely to generalise to all other learning contexts.

And on the subject of context, this study also made me think about the disciplinary contexts of the studies that have shown positive effects of AI tutors as substitutes for human tutors, compared with those that have shown negative (or null) effects. The studies that have shown positive effects have tended to be in physics, computer science, or (now) English language study. In contrast, the studies that have not had such positive findings have been in law or social sciences.

Generalising wildly, there is a distinction between the two groups of subject areas. Perhaps an AI tutor works well as a substitute for a human tutor when the subject area consists primarily of problems that are well-defined and answers can be specified in an objective way (like a maths problem, or learning a language), but not when the problems are more open-ended and answers are more subjective? That would make intuitive sense.

AI tutors that have been prompted to follow a tutorial or workshop script (with questions and answers specified in advance) are given boundaries. That will work best when the AI tutor and student don't stray too far off-script. Staying within the boundaries will be relatively easier if the question and the answer are well defined and objective.

However, in a tutorial or workshop where the answers are more subjective, the teachers who create the system prompt may find it more difficult to anticipate all of the directions in which the conversation between the student and AI tutor may go. The script may not be able to cover all possibilities, and even if the script is quite detailed, it may be more difficult for the AI tutor (and the student) to stay close to the script. In my experience, the longer the system prompt, the more likely it is that ChatGPT ignores part of the prompt. So, when the question and answer are more subjective, there may be more scope for the AI tutor to introduce irrelevant material, hallucinate, or steer students wrong. That might explain the results from Saturday's post.

If my speculations above are correct, then that has interesting implications for economics, where the questions that we ask students can easily encompass both the objective and subjective (and economics is not alone in that). Clearly, there is more research to be done here (including my own, but more on that in a future post). Understanding whether AI tutors will work best as a substitute or complement for human tutors (and if the answer is context-dependent, then the best contexts for each) is important for the future of education.

[HT: Ethan Mollick, via Marginal Revolution]

Read more:

No comments:

Post a Comment