I'm becoming more convinced over time that there are good approaches to using large language models (LLMs) at tutors, and not-so-good approaches. That statement seems kind of obvious, so let me unpack it a little bit.
There are two main ways that teachers appear to be using LLMs as tutors. The first approach is to give the LLM tutorial or workshop material (or problem sets, or questions), and prompt the LLM tutor to walk the student through the tutorial or workshop (or whatever). The LLM may be prompted to act as a Socratic tutor, encouraging the student to answer questions, while the LLM tutor doesn't directly provide answers. This is the style of tutor employed in the paper I discussed in this post (as one example). This does seem to be the way that most teachers are approaching this, because it is most often the way that we expect human tutors to engage with students. LLM tutors set up in this way are substitutes for human tutors.
The second approach to using an LLM as a tutor is to set it up as a source of answers for student queries. The LLM has a basic prompt, as well as access to the teaching materials, and students can ask questions and the LLM will provide answers, additional context, or examples. This is how we have set up Harriet, our ECONS101 AI tutor. These LLM tutors are not trying to mimic the human tutor, but instead are essentially a reference textbook that students can query and converse with. Indeed, this is probably what the future of textbook publishing will look like, with textbooks replaced by chatbots. LLM tutors set up in this way are complements for human tutors.
On the surface, either approach to employing LLM tutors is valid. Teachers with a strong grounding in the education literature are likely to prefer the first way. However, in my view, LLM tutors are not human tutors. We shouldn't be expecting students to engage with LLM tutors in the same way that they engage with human tutors. At this stage, I believe that LLM tutors might be better as complements to human tutoring, rather than substitutes (which is why Harriet is set up the way that she is).
Now, some new evidence suggests that LLM tutors, when trained to be substitutes, may fall short in important ways. This article by Armin Alimardani (University of Wollongong) and Emma Jane (University of New South Wales Sydney), forthcoming in the journal Law, Technology and Humans (open access, with a non-technical summary on The Conversation), looks at the use of a ChatGPT tutor in several law classes at the University of Wollongong in Australia. This research is interesting, because the initial implementation of the AI tutor predates the release of ChatGPT (they used a pre-release version of GPT-4). However, Alimardani and Jane also report on results from more recent versions of ChatGPT. In their initial implementation:
...we created SmartTest, an educational chatbot with features designed to align with educational objectives. The chatbot allows educators to pose pre-determined questions to students, ensuring that the topics and focus areas match learning goals. Feedback is generated based on educator-drafted answers to provide a coherent learning path...
The system prompts in the test cycles included three main sections: (1) Instructions—SmartTest’s role and the steps it should take when interacting with students. For example, if a student asks an unrelated question, SmartTest refuses to answer and redirects them to the test cycle question; (2) The Questions; and (3) The Answer Guide.
Essentially, this is GPT as a substitute for a human tutor. Alimardani and Jane are able to see all interactions between the LLM tutor and students, which provides a wealth of data, even though the actual number of students is relatively small (fewer than 50 students in the second semester of 2023). It also appears that SmartTest was available to use during class time as part of a regular tutorial (although this point is not entirely clear from the article). Anyway, Alimardani and Jane made SmartTest available to students across five 'test cycles', each two weeks long. Looking at the conversations, they:
...reviewed SmartTest’s interactions with students to determine the rate of erroneous feedback. Errors were defined as situations in which SmartTest provided incorrect information, failed to adequately correct a student’s inaccurate or partial response, or provided confusing feedback. Errors included cases where a student gave an incorrect or partial response, and SmartTest replied with a positive statement (e.g., ‘That’s correct!’), before presenting the correct answer.
Alimardani and Jane find that:
...during the initial three test cycles, which featured short problem scenarios, the percentage of conversations containing at least one error ranged from 39.5% to 53.5%. By contrast, in Cycles 4 and 5, which involved short-answer questions, the error rate dropped to a range of 6.3% to 26.9%. This variation in performance is partly explained by the difference in the complexity of questions and answer guides in earlier cycles.
It is important to note that they made the questions that SmartTest asked students much shorter and less detailed in the later cycles. However, the error rate is quite worrying, especially when you consider that SmartTest had access to the answers!
Alimardani and Jane then tested more recent iterations of ChatGPT (ChatGPT-4o, GPT-4-Turbo, o1, o1-mini, o3-mini, and GPT-4.5), not with students, but by pre-seeding some of the conversations where the GPT-4 version of SmartTest had made errors. They find that:
On average, GPT-4o and GPT-4.5 achieved similar scores and outperformed the other models. Notably, the average performance of the reasoning models (o1, o1-mini and o3-mini) was worse than the other models.
The more recent models did not outperform GPT-4, and in fact the more complex reasoning models did worse! That result was a real surprise to me, especially when you consider that LLM performance is trending upwards over time. Perhaps the more recent LLMs are more willing to deviate from the details in the prompt (which includes the questions and the answers to the tutorial problems)? Alimardani and Jane aren't able to provide a satisfactory answer as to why the latest models perform worse, or a concrete solution for that problem.
Despite the errors, LLM tutors might still have value if the students find them helpful. Remember, human tutors are not perfect either! Alimardani and Jane:
...asked students to identify the most appealing features of SmartTest. The feature that received the most votes (n=42) was the instant feedback provided by the chatbot. This was followed by two features, each receiving 28 votes: the conversational format, which allows students to break down their answers and receive feedback on each component separately, and the option to express uncertainty to receive guidance...
Despite identifying instant feedback from SmartTest as its most appealing feature, the students would still prefer feedback from human tutors. Alimardani and Jane:
...asked students to rank their preferred mode of feedback to explore whether the delay in receiving feedback from tutors would influence students’ preference for SmartTest. We found that 51.52% (17 students) selected receiving feedback from their tutor on their learning management system (LMS) with a delay of one or more days as their top choice. In contrast, only 27.27% (9 students) preferred using SmartTest for feedback as their first option.
These results demonstrate that students themselves don't prefer LLM tutors as a substitute for human tutors. That would explain the University of Auckland's recent experience with an AI tutor in marketing. Future versions of LLMs may do a better job as a substitute for human tutors (although Alimardani and Jane's results should make us a little wary about that). However, for now, an LLM tutor may be best employed as a complement for human tutoring.
[HT: The Conversation]
Read more:

