The human brain is built (in part) to recognise and act on patterns - it is "
one of the most fundamental cognitive skills we possess". Often, we can see patterns in what is essentially random noise. Another way of thinking about that is that humans are pretty bad at recognising true randomness (for example, see
here or
here), and perceive bias or patterns in random processes.
Now, consider a multiple choice test, with four options for each question (A, B, C, or D). When a most teachers prepare multiple choice tests, they probably aren't thinking about whether the answers will appear
sufficiently random to students. That is, they probably aren't thinking about how students will perceive the sequence of answers to each question, and whether the sequence will affect how students answer. That can lead to some interesting consequences. For instance, in a recent semester in ECON100, we have five answers of 'C' in a row (and of the 15 questions, 'C' was the correct answer for eight of them). I didn't even realise we had set that up until I was writing up the answers (just before the students sat the test), and it made me a little worried.
Why was I worried? Consider this 'trembling hand hypothesis': Say that a student is very unsure of the answer, but they think it might be 'C'. But, they are also aware that there are four possible answers, and they believe that the teacher is likely to spread the answers around in a fairly random way. If this student had answered 'C' to the previous question, that might not be a problem. But if they had answered 'C' to the previous
four answers, that might cause them to re-consider their answer. Their uncertainty then may cause them to change their answer (or one of the earlier answers that they are unsure of), even though 'C' might be the correct answer (or their preferred answer, even though they are unsure).
Conversations with students after that ECON100 test with the many 'C' answers suggested to me that it probably didn't cause too many students to change their answers, but it did raise their anxiety levels. However,
a new paper by Hubert János Kiss (Eötvös Loránd University, Hungary) & Adrienn Selei (Regional Centre For Energy Policy Research, Hungary), published in the journal
Education Economics (sorry I don't see an ungated version online), looks at this in more depth.
Kiss and Selei use data from 153 students who sat one (or more) of five exams at Eötvös Loránd University over a two-week period. All five exams were for the same course (students could choose which exam time they attended, but the exam questions were different at each time). The authors ensured that half of students in each exam session had an exam paper where there were 'streaks' of correct answers that were the same letter, and half of the students had an exam paper with a more 'usual' distribution of answers. They then tested the differences between the two groups, and found:
Treatment has a significant effect at date 1 [for the first exam]. Points obtained in the multiple-choice test are 3 points lower in the treated group even if we control for the other variables. However, at the other dates and when looking at the overall data, treatment has no significant effect.
They then concluded that there was no treatment effect - that is, that the 'streaks' did not affect student performance. However, students in the first exam
were significantly negatively affects (and received about three fewer marks out of 100). Presumably, students talk to each other, and these students in the first exam would have told other students about the unusual pattern of multiple choice answers they found (even though they didn't know the correct answers at that time). So, students in the later exams would probably have been primed not to be caught out by 'streaks' of answers. To be fair, the authors note this:
One may argue that after the first exam, students learned from their peers that streaks were not uncommon, causing the treatment effect to become insignificant later. Unfortunately, we cannot test if this is the case.
Indeed, but it doesn't seem unlikely. Kiss and Selei then go on to test whether students who give a particular letter answer to a question are more (or less) likely to give the same letter answer to the next question, and find that:
In half of the cases, the effect of having an identical previous correct answer (samecorrect1) is not significant at the usual significance levels... In the control treatments, we tend to observe a significant positive effect. Having two identical previous correct answers (samecorrect2) has a consistently negative impact on the probability of giving the correct answer to a given question, and this effect is significant... However, the effect of having three identical previous correct answers (samecorrect3) goes against our expectations, as in the cases where it has a significant effect, this effect is positive!
These results are a little unusual, but I think the problem is in the analysis. There are essentially two effects occurring here. First, good students are more likely to get the correct answer, regardless of whether it is the same letter answer as the previous question. Second, students may have a trembling hand when they observe a 'streak' of the same letter answer. Students who are willing to maintain a streak are likely to be the better students (since the not-so-good students eventually switch out of the streak due to the trembling hand, especially if the trembling hand effect increases with the length of the 'streak'). So, it doesn't at all surprise me that observing two previous same letter answers leads students on average to switch to an incorrect answer, but for that effect to become statistically insignificant for longer streaks - only the good students remain in the streak.
The authors control for student quality by using their results in essay questions, but that only adjusts for average (mean) differences between good and not-so-good students. It doesn't test whether the 'streaks' have different effects on different quality students. In any case, their sample size is probably too small to detect any difference in these effects.
All of which still leaves me wondering, should we worry about non-randomness in multiple choice answers? We'll have to wait for a more robust study for an answer to that question, and in the meantime, I'll make sure to check the distribution of answers to ECONS101 multiple choice questions. Just in case.
Read more: