Sex, Drugs and Economics: Some notes on generative AI and assessment (in higher education)

Last week, I posted some notes on generative AI in higher education, focusing on positive uses for AI for academic staff. Today, I want to follow up with a few notes on generative AI and assessment, based on some notes I made for a discussion at the School of Psychological and Social Sciences this afternoon. That discussion quickly evolved into more of a discussion on intentional design of assessment more generally, rather than focusing on the risks of generative AI to assessment more specifically. That's probably a good thing. Any time academic staff are thinking more intentionally about the assessment design, the outcomes are likely to be better for students (and for the staff as well).

Anyway, here are a few notes that I made. Most importantly, the impact of generative AI on assessment, and the robustness of any particular item of assessment to generative AI, depends on context. As I see it, there are three main elements of the context of assessment that matter most.

First, assessment can be formative, or summative (see here, for example). The purpose of formative assessment is to promote student learning, and provide actionable feedback that students can use to improve. Formative assessment is typically low stakes, and the size and scope of any assessment item is usually quite small. Generative AI diminishes the potential for learning from formative assessment. If students are outsourcing (part of) their assessment to generative AI, then they aren't benefiting from the feedback or the opportunity for learning that this type of assessment provides.

Summative assessment, in contrast, is designed to evaluate learning, distinguish good students from not-so-good students from failing students, and award grades. Summative assessment is typically high stakes, with a larger size and scope of assessment than formative assessment. Generative AI is a problem in summative assessment because it may diminish the validity of the assessment, in terms of its ability to distinguish between good students and not-so-good students, or between not-so-good students and failing students, or (worst of all) between good students and failing students.

Second, the level of skills that are assessed is important. In this context, I am a fan of Bloom's taxonomy (which has many critics, but in my view still captures the key idea that there is a hierarchy of skills that students develop over the course of their studies). In Bloom's taxonomy, the 'cognitive domain' of learning objectives is separated into six levels (from lowest to highest): (1) Knowledge; (2) Comprehension; (3) Application; (4) Analysis; (5) Synthesis; and (6) Evaluation.

Typically, first-year papers (like ECONS101 or ECONS102 that I teach) predominantly assess skills and learning objectives in the first four levels. Senior undergraduate papers mostly assess skills and learning objectives in the last three levels. Teachers might hope that generative AI is better at the lower levels - things like definitions, classification, understanding and application of simple theories, models, and techniques. And indeed, it is. Teachers might also hope that generative AI is less good at the higher levels - things like synthesising papers, evaluating arguments, and presenting its own arguments. Unfortunately, it also appears that generative AI is also good at those skills. However, context does matter. In my experience, and this is subject to change because generative AI models are improving rapidly, generative AI can mimic the ability of even good students at tasks at low levels of Bloom's taxonomy, which means that tasks at that end lack any robustness to generative AI. However, at tasks higher on Bloom's taxonomy, generative AI can mimic the ability of failing and not-so-good students, but is still outperformed by good students. So, many assessments like essays or assignments that require higher-level skills may still be a robust way of identifying the top students, but will be much less useful for distinguishing between students who are failing and students who are not-so-good.

Third, authenticity of assessment matters. Authentic assessment (see here, for example) is assessment that requires students to apply their knowledge in a real-world contextualised task. Writing a report or a policy brief is a more authentic assessment than answering a series of workbook problems, for example. Teachers might hope that authentic assessment would engage students more, and reduce the use of generative AI. I am quite sure that many students are more engaged when assessment is authentic. I am less sure that generative AI is used less when assessment is authentic. And, despite any hopes that teachers have, generative AI is just as good in an authentic assessment as it is in other assessments. It might be better in fact. Consider the example of a report or a policy brief. The training datasets of generative AI no doubt contain lots of reports and policy briefs, so it has lots of experience with exactly the types of tasks we might ask students to complete in an authentic assessment.

So, given these contextual factors, what types of assessment are robust to generative AI. I hate to say it, and I'm sure many people will disagree, but in-person assessment cannot be beaten in terms of robustness to generative AI. In-person tests and examinations, in-person presentations, in-class exercises, class participation or contributions, and so on, are assessment types where it is not impossible for generative AI to influence, but where it is certainly very difficult for it to do so. Oral examinations are probably the most robust of all. It is impossible to hide your lack of knowledge in a conversation with your teacher. This is why universities often use oral examinations at the end of a PhD.

In-person assessment is valid for formative and summative assessment (although the specific assessments used will vary). It is valid at all levels of learning objectives that students are expected to meet. It is valid regardless of whether assessment is authentic or not. Yes, in case it's not clear, I am advocating for more in-person assessment.

After in-person assessment, I think the next best option is video assessment. But not for long. Using generative AI to create a video avatar to attend Zoom tutorials, or to make a presentation, is already possible (HeyGen is one example of this). In the meantime though, video reflections (as I use in ECONS101), interactive online tutorials or workshops, online presentations, or question-and-answer sessions, are all valid assessments that are somewhat robust to AI.

Next are group assessments, like group projects or group assignments, or group video presentations. The reason that I believe group assessments are somewhat robust is that it requires a certain amount of group cohesion to make a sustained effort at 'cheating'. I don't believe that most groups that are formed within a single class are cohesive enough to maintain this (although I am probably too hopeful here!). Of course, there will be cases when just one group member's contribution to a larger project was created with generative AI, but generally it would take the entire group to do so. When generative AI for video becomes more widespread, group assessments will become a more valid assessment alternative than video assessment.

Next are long-form written assessments, like essays. I'm not a fan of essays, as I don't think they are authentic as assessment, and I don't think they assess skills that most students are likely to use in the real world (unless they are going onto graduate study). However, they might still be a valid way of distinguishing between good students and not-so-good students. To see why, read this New Yorker article by Cal Newport. Among other issues, the short context window of most generative AI models means that it is not great at long-form writing, at least compared with shorter pieces. However, generative AI's shortcomings here will not last, and that's why I've ranked long-form writing so low.

Finally, online tests, quizzes, and the likes should no longer be used for assessment. The development of browser plug-ins that can be used to answer multiple-choice, true/false, fill-in-the-blanks, and short-answer-style questions automatically, with minimal student input (other than perhaps to hit the 'submit' button), makes these types of assessments invalid. Any attempts to thwart generative AI in this space (and I've seen things like using hidden text, using pictures rather than text, and other similar workarounds) are at best an arms race. Best to get out of that now, rather than wasting lots of time trying (but generally failing) to stay one step ahead of the generative AI tools.

Finally, I know that many of my colleagues have become attracted to getting students to use generative AI in assessment. This is the "if you can't beat them, join them" solution to generative AI's impact on assessment. I am not convinced that this is a solution, for two reasons.

First, as is well recognised, generative AI has a tendency to hallucinate. Users know this, and can recognise when a generative AI has hallucinated in a domain in which they (the user) have specific knowledge. If students, who are supposed to be developing their own knowledge, are being asked to use or work with generative AI in their assessment, at what point will those students develop their own knowledge that they can use to recognise when the generative AI tool that they are working with is hallucinating? Critical thinking is an important skill for students to develop, but criticality in relation to generative AI use often requires the application of domain-specific knowledge. So, at the least, I wouldn't like to see students encouraged to work with generative AI until they have a lot of the basics (skills that are low on Bloom's taxonomy) nailed first. Let generative AI help them with analysis, synthesis, or evaluation, while the student's own skills in knowledge, comprehension, and application allow them to identify generative AI hallucinations.

Second, the specific implementations of assessments that involve students working with generative AI are not often well thought through. One common example I have seen is to give students a passage of text that was written by AI in response to some prompt, and ask students to critique the AI response. I wonder, in that case, what stops the students from simply asking a different generative AI model to critique the first model's passage of text?

There are good examples of getting students to work with generative AI though. One involves asking students to write a prompt, retrieve the generative AI output, and then engage in a conversation with the generative AI model to improve the output, finally constructing an answer that combines both the generative AI output and the student's own ideas. The student then submits this final answer, along with the entire transcript of their conversation with the generative AI model. This type of assessment has the advantage of being very authentic, because it is likely that this is how most working people engage with generative AI for competing work tasks (I know that it's one of the ways that I engage with generative AI). Of course, it is then more work for the marker to look at both the answer and the transcript that led to that answer. But then again, as I noted in last week's post, generative AI may be able to help with the marking!

You can see that I'm trying to finish this post on a positive note. Generative AI is not all bad for assessment. It does create challenges. Those challenges are not insurmountable (unless you are offering purely online education, in which case good luck to you!). And it may be that generative AI can be used in sensible ways to assist in students' learning (as I noted last week), as well as in students completing assessment. However, we first need to ensure that students are given adequate opportunity to develop a grounding on which they can apply critical thinking skills to the output of generative AI models.

[HT: Devon Polaschek for the New Yorker article]