Sex, Drugs and Economics: A cautionary tale on analysing classroom experiments

Friday, 21 October 2016

A cautionary tale on analysing classroom experiments

Back in June I wrote a post about this paper by Tisha Emerson and Linda English (both Baylor University) on classroom experiments. The takeaway message (at least for me) from the Emerson and English paper was the there is such a thing as too much of a good thing - there are diminishing marginal returns to classroom experiments, and the optimal number of experiments in a semester class is between five and eight.

Emerson and English have a companion paper published in the latest issue of the Journal of Economic Education, where they look at additional data from their students over the period 2002-2013 (sorry I don't see an ungated version anywhere). In this new paper, they slice and dice the data in a number of different ways from the AER paper (more on that in a moment). They find:

After controlling for student aptitude, educational background, and other student characteristics, we find a positive, statistically significant relationship between participation in experiments and positive learning. In other words, exposure to the experimental treatment is associated with students answering more questions correctly on the posttest (despite missing the questions initially on the pretest). We find no statistically significant difference between participation in experiments and negative learning (i.e., missing questions on the posttest that were answered correctly on the pretest). These results are consistent with many previous studies that found a positive connection between participation in experiments and overall student achievement.

Except, those results aren't actually consistent with other studies, many of which find that classroom experiments have significant positive impacts on learning. The problem is the measure "positive learning". This counts the number of TUCE (Test of Understanding of College Economics) questions the students got wrong on the pre-test, but right on the post-test. The authors make a case for this positive learning measure as a preferred measure rather than the net effect on TUCE, but I don't buy it. Most teachers would be interested in the net, overall, effect of classroom experiments on learning. If classroom experiments increase students' learning in one area, but reduce it in another, so that the overall effect is zero, then that is the important thing. Which means that the "negative learning" (the number of TUCE questions the students got right on the pre-test, but wrong on the post-test) must also be counted. And while Emerson and English find no effect on negative learning, if they run the analysis on the net overall change in TUCE scores (which you can get by subtracting their negative learning measure from their positive learning measure), they find that classroom experiments are statistically insignificant. That is, there is no net effect of classroom experiments on students' performance in TUCE.

Next, Emerson and English start to look at the relationship between various individual experiments and TUCE scores (both overall TUCE scores and scores for particular subsets of questions). They essentially run a bunch of regressions, where in each regression the dependent variable (positive or negative learning) is regressed against a dummy variable for participating in a given experiment, as well as a bunch of control variables. This part of the analysis is problematic because of the multiple comparisons problem - when you run dozens of regressions, you can expect one in ten of them to show your variable of interest is statistically significant (at the 10% level) simply by chance. The more regressions you run, the more of these 'pure chance' statistically significant findings you will observe.

Now, there are statistical adjustments you can make to the critical t-values for statistical significance. In the past, I'm as guilty as anyone for not making those adjustments when they may be necessary. At least I'm aware of it though. My rule of thumb in these cases where multiple comparisons might be an issue is that if there isn't some pattern to the results, then what you are observing is possibly not real at all and the results need to be treated with due caution. In this case, there isn't much of a pattern at all, and the experiments that show statistically significant results (especially those that are significant only at the 10% level) are showing effects that might not be 'real' (in the sense that they are not pure chance results).

So, my conclusion on this new Emerson and English paper is that not all classroom experiments are necessarily good for learning, and the overall impact might be neutral. Some experiments are better than others, so if you are limiting yourself to five (as per my previous post), this new article might help you select those that may work best (although it would be more helpful if they had been more specific about exactly which experiments they were using!).

Read more: