Late last year, an article in the American Economic Review by Ryan Oprea caught my attention (and I blogged about it here). It purported to show that the key experimental results underlying Prospect Theory may in part be driven by the complexity of the experiments that are used to test them. These were extraordinary results. And when you publish a paper with extraordinary results, that could potentially overturn a large literature on a particular theory, then those results are going to attract substantial scrutiny. And indeed, that is what has happened with Oprea's paper.
The team at DataColada, most well-known for exposing the data fakery of Dan Ariely and Francesca Gino (and the resulting lawsuit, which was dismissed), have a new working paper, authored by Daniel Banki (ESADE Business School) and co-authors, looking at Oprea's results (see also the blog post on DataColada by Uri Simonsohn, one of the co-authors). To be clear before I discuss Banki et al.'s critique, they don't accuse Oprea of any misconduct. They mostly present an alternative view of the data and results that appears to contradict key conclusions that Oprea finds in his paper. Oprea has also provided a response to some of their critique.
I'm not going to summarise Oprea's original paper in detail, as you can read my comments on it here. However, the key result in the paper is that when presented with risky choices, research participants' behaviour was consistent with Prospect Theory, and when presented with choices that involved no risk at all but were complex in a similar way to the risky choices ('deterministic mirrors'), research participants' behaviour was also consistent with Prospect Theory. This suggests that a large part of the observed results that underlie Prospect Theory may arise because of the complexity of the choice tasks that research participants are presented with.
Banki et al. look at a number of 'comprehension questions' that Oprea presented research participants with, and note that:
...75% of participants made an error on at least one of the comprehension questions, such as erroneously indicating that the riskless mirror had risk.
Once the data from those research participants is excluded, Banki et al. show that research participant behaviour differs between lotteries and mirrors for the research participants who 'passed' the comprehension checks (by getting all four of the comprehension questions correct on their first try). This is captured in Figure 2 from Banki et al.'s paper:
The two panels on the left of Figure 2 show the results for the full sample, and notice that both lotteries (top panel) and mirrors (bottom panel) look similar in terms of results. In contrast, when the sample is restricted to those that 'passed' the comprehension checks, the results for lotteries and mirrors look very different. Which is what we would expect, if research participants are not 'fooled' by the complexity of the task.
Banki et al. provide a compelling reason why the results for the research participants who failed the comprehension checks looks the same for lotteries and mirrors: regression to the mean. As Simonsohn explains in the DataColada blog post, this arises because of the way that a multiple-price list works:
When the dependent variable is how much people value prospects, regression to the mean creates spurious evidence in line with prospect theory. When people answer randomly for 10% chance of $25, they overvalue it, because the “right” valuation is $2.50, and the scale mostly contains values that are higher than that. When people answer randomly for 90% chance of $25, they undervalue it, because the “right” valuation is $22.50 and the scale mostly contains values that are lower than that. Thus, random or careless responding will produce the same pattern predicted by prospect theory.
Oprea responds to both of these points, noting that:
...a range of imperfectly rational behaviors including noisy valuations, anchoring-and-adjustment heuristics, compromise heuristics and pull-to-the-center heuristics will all tend to produce prospect-theoretic patterns of behavior simply because of the nature of valuation. BSWW offer this possibility as an alternative to the Oprea (2024)’s account of his data, but in fact these are examples of exactly the types of cognitive shortcuts Oprea (2024) was designed to study.
In other words, Banki et al.'s results don't refute Oprea's results, but are very much in line with Oprea's. One thing that Oprea does take issue with is Banki et al.'s use of medians as the preferred measure of central tendency. Oprea uses the mean, and when reanalysing the data with the same exclusions as Banki et al., Oprea shows that the mean results look similar to the original paper. So, Banki et al.'s results are not simply driven by excluding the research participants who failed the comprehension checks, but also by switching from using the mean to using the median.
On that point, I'm inclined to agree with Banki et al. The median is often used in experimental economics, because it is less influenced by outliers. And if you look at Oprea's data, there are a lot of large outliers, which become quite influential observations when the mean is used as the summary statistic. However, the outliers are likely to be the observations you want to have the smallest effect on your results, not the largest effect.
Oprea also critiques Banki et al.'s interpretation of the comprehension questions. Oprea rightly notes that:
...it is important to emphasize that these training questions weren’t designed to measure beliefs (e.g., payoff confusion), and because of this they are poorly suited to the task BSWW repurpose it for, ex post. Indeed, evidence from the patterns of mistakes made in these questions suggests that overall training errors largely serve as a measure of the cognitive effort (an important ingredient in Oprea (2024)’s account) subjects apply to answering these questions, and that BSWW therefore substantially overestimate the level of payoff confusion with which subjects entered the experiment.
In other words, the 'comprehension questions' are not comprehension questions at all, but they are really 'training questions' that were used to train the research participants to understand the choice tasks that they would be presented with. And so, using those training questions overall as a measure of understanding misses the point, and seriously underestimates the amount of understanding of the task that research participants had by the time they had completed the training questions.
Oprea's response is good on this point. However, if the training questions had really done a good job of training the research participants, then all participants should have had a similar level of understanding by the end of the training questions, and there should be no detectable differences in behaviour between those with more, and those with fewer, 'failed' training questions. That wasn't the case - the behaviour of the research participants who made errors in training was much more likely to be the same for lotteries and mirrors than was the behaviour of research participants who made no errors. To clear this up, it would have been interesting to have research participants also complete 'comprehension questions' at the end of the experimental session, to see if they still understood the tasks they were being asked to complete. At that point, those failing the comprehension questions could be dropped from the dataset.
One point of Banki et al.'s critique that Oprea hasn't engaged with (yet, although he promises to do so in a future, more complete response), is their finding that a larger than 'usual' proportion of the research participants fail 'first order stochastic dominance' (FOSD). A failure of FOSD in this context means that a research participant valued a lottery (or mirror) lower than a similar lottery that was strictly better. For example, valuing a 90% chance of receiving $25 less than a 10% chance of receiving $25 is a failure of FOSD. Banki et al. show that:
We begin by examining G10 and G90. Violating FOSD here involves valuing the 10% prospect strictly more than the 90% one. Across all participants (N = 583), 14.8% violated FOSD for mirrors, and 13.9% for lotteries. These rates are quite high given that the prospects differ in expected value by a factor of nine.
Those failure rates are much higher than for other similar research studies. Banki et al. note an overall rate of 20.8 percent in the Oprea results, compared with an average of 3.4 percent across eight other highly cited studies. It will be interesting to see how Oprea responds to that point in the future.
This is an interesting debate so far. Oprea does a good job of summing up where this debate should probably go next:
Ultimately, however, these questions and ambiguities can only be fully resolved by further research. While BSWW’s critique has not convinced me that the interpretation offered in Oprea (2024) is mistaken, I am eager to see new experiments that deepen, alter, or even overturn this interpretation. First, concerns that the Oprea (2024)’s results are a consequence of the design being too confusing to yield insight can only really be resolved one way or another by followup experiments that vary his procedures, instructions and other design choices in such a way as to satisfy us that the Oprea (2024) results are (or are not) overfit to that design.
Indeed, more follow-up research is needed. Prospect Theory hasn't been overturned, yet (and as I noted in my earlier post, it is consistent with a lot of real-world behaviour). However, now we know that it may be vulnerable, and Oprea's paper provides a starting point for testing more thoroughly how much of the experimental results arise from complexity.
[HT: Riccardo Scarpa]
Read more:
Weird prank but okay
ReplyDelete