In this post over the weekend, I discussed a paper that looked at the performance of ChatGPT 3.5 in answering economics homework questions (multiple choice, and long answer). Obviously, lots of teachers are interested in the topic of how well large language models perform in answering questions, and lots of researchers (who are also teachers) are investigating the performance of large language models as tutors, and the effects on student learning (including me - more on that in a future post).
Another paper that looks at how well ChatGPT performs as an economics tutor is this working paper by Natalie Bröse (University of Applied Sciences Bonn-Rhein-Sieg), Christian Spielmann (University of Bristol), and Christian Tode (University of Applied Sciences Bonn-Rhein-Sieg). In the paper, Bröse et al.:
...analyze the performance of three ChatGPT models (GPT-3.5, GPT-4o, and o1preview) in two key tasks: explaining economic concepts and answering multiple-choice questions with explanations. We use CORE Econ’s The Economy 1.0 (Bowles, Carlin, and Stevens 2017) as our reference material and evaluate ChatGPT’s responses across multiple dimensions relevant to student learning.
It is interesting (to me) that Bröse et al. use the CORE Econ text, because that is the text we use as a base for my ECONS101 paper (although we do not follow it particularly closely). They also undertake quite a detailed investigation of the quality and characteristics of ChatGPT's responses to questions about economic concepts, as well as the characteristics of multiple-choice questions that ChatGPT gets right or wrong. This adds a lot of depth to the analysis, and is really helpful for those of us who are training or testing our own economics AI tutors, because it highlights some areas where even the most recent iterations of ChatGPT (like 01preview) still seem to struggle.
When looking at the explanation of economic concepts, Bröse et al. find that:
the inclusion of inaccurate or misleading information in at least an important part of the concept... occur in 28.6% of GPT-3.5 outputs, 17.8% of GPT-4o outputs, and 12.5% of o1preview outputs.
It will depend on your priors as to how good or bad you think those results are. I think the trend in improvement across versions of ChatGPT is instructive in terms of the improvement over time, and what we might expect from more recent versions. Looking at the types of errors that are made, Bröse et al. find that:
...over 75% of the errors across all models are factual errors. This means that information provided is incorrect based on established facts, either because models fail to retrieve accurate information from their training data, or facts are incorrectly assimilated during the learning process.
That wouldn't be so bad, except that:
...inaccurate responses are often open-domain (i.e., users cannot identify incorrect information without extensive research outside of the chat) in 76.67% of cases for GPT-3.5, 85.19% for GPT-4o, and 64.7% for o1preview. This changes slightly when focusing solely on responses with an accuracy score of 3 or lower... inaccurate responses are of the open-domain in 76.47% of cases for GPT-3.5, 91.91% for GPT-4o, and 50% for o1preview.
So, when ChatGPT makes errors in explaining concepts, it often does so in a way that the users cannot easily realise that an error has been made. To a large extent, that isn't surprising. If the error was obvious without extensive research outside the chat, the chances are that ChatGPT would not have made the error in the first place. However, it also highlights that it still pays for users to double-check ChatGPT's responses to questions like this, when they are not sure. In fact, I can report that several of my ECONS101 students this trimester have queried Harriet's responses to a question, and she has immediately realised that she made an error! There is good learning for both the student and the AI tutor from such an experience (or, at least for the student, because Harriet won't remember the interaction later - I just update her knowledge base to try and minimise future errors of the same type).
Bröse et al. then go on to report that ChatGPT generally gives a clear and accessible explanation, but that the responses often "lack important detail or nuance" (which they term 'scope'). That doesn't surprise me for a one-shot response to a question, and I think that Bröse et al. are being a little unfair in their assessment of ChatGPT on the scope dimension. If the user wants additional detail or nuance, then they really should ask additional follow-up questions, rather than expecting a detailed and nuanced treatise on a topic in a one-shot query.
Bröse et al. also note that, in relation to the quality of examples provided by ChatGPT:
66.07% of responses from GPT-3.5, 73.21% from GPT-4o, and 51.79% from o1preview received a score of 3 or lower, indicating that the examples provided were weak, unhelpful, or even detrimental to comprehension. An additional 25% of GPT-3.5 responses, 17.86% of GPT-4o responses, and 42.86% of o1preview responses attained a score of 4, suggesting that the examples were relevant but overly simplistic.
That has been my experience of ChatGPT as well, and looking at transcripts from students' conversations with Harriet (for students who have shared them with me), it does seem to be a general trend. However, again, using the simplistic example as a starting point and then engaging in a more detailed conversation as a follow-up does generally lead to better results.
Turning to multiple choice questions, Bröse et al. find that:
GPT-3.5 correctly assessed 67% of the options, GPT-4o achieved 91%, and o1preview reached 93%.
This provides further evidence, if any were needed, that online multiple-choice tests are not going to be an effective assessment tool ever again. Unless the goal is to hand out some free marks to students, of course.
Bröse et al. finish their paper by offering some advice for teachers:
- Integrate ChatGPT into economics courses by leveraging its strengths in explanations and question-answering while educating students on its capabilities and limitations...
- Teach students to use ChatGPT as a learning aid, not a replacement for textbooks, lectures, or their own effort in solving problems...
- Acknowledge that students will struggle to detect incorrect output...
- Shift classroom focus to application-based learning by leveraging a flipped classroom model and providing clear, detailed examples...
- Guide students to specify the question context as well as the chatbot’s role when prompting...
- Refine your problem statements for easier comprehension...
- Verify translation accuracy before using ChatGPT in Non-English courses.
The last piece of advice is interesting because they also evaluated ChatGPT in German and generally found that ChatGPT started by translating the query into English, answering it, then back-translating. Some of their other advice can be taken out of students' hands by the teacher creating their own AI tutor (as I have). Then, prompting and problem statements become less of an issue.
Overall, this paper is a good reminder of how much improvement there has been in large language models in just the last couple of years (ChatGPT-3.5 was released in November 2022). Even o1-preview is nearly a year old now, and has been supplanted. However, we need to remain mindful that large language models do convey an aura of certainty and accuracy in their responses that is not always warranted. At the very least, when in doubt, ask ChatGPT to check its answers!
Read more:
No comments:
Post a Comment