Sex, Drugs and Economics: Gorillas in the midst of data

Friday, 1 October 2021

Gorillas in the midst of data

In a famous experiment, Daniel Simons and Christopher Chabris tested people's selective attention, asking them to count the number of times a basketball is passed. If you haven't seen or heard of this test, you should try it out for yourself:

Did you see the gorilla? Some years ago, I was shown a different video with the same premise, and I totally missed it. The whole point is that, when we are very focused on a particular task, we can totally miss other important things that are going on.

That brings me to this 2020 article by Itai Yanai (NYU Langone Health) and Martin Lercher (Heinrich Heine University), published in the journal Genome Biology (open access). Yanai and Lercher gave students two datasets, which each contained data on body mass index (BMI) and the number of steps taken each day. One dataset was data points for women, and one dataset was data points for men. Yanai and Lercher then placed the students into two groups, and gave the two groups different instructions:

The students in the first group were asked to consider three specific hypotheses: (i) that there is a statistically significant difference in the average number of steps taken by men and women, (ii) that there is a negative correlation between the number of steps and the BMI for women, and (iii) that this correlation is positive for men. They were also asked if there was anything else they could conclude from the dataset. In the second, “hypothesis-free,” group, students were simply asked: What do you conclude from the dataset?

When you're given a dataset with no idea of what to look for, it is natural to start with some simple tabulations or data visualisations. And, if students merged the two datasets together and graphed BMI against steps per day, they saw this (from Figure 1a of the paper):

Yes, that is a gorilla waving at you from the data. Interestingly, Yanai and Lercher find that:

...overall, students without a specific hypothesis were almost five times more likely to discover the gorilla when analyzing this dataset...

Specifically, nine of 14 students who were given no hypotheses found the gorilla, but only five of 19 students who were given hypotheses did so. This was a small-scale study, and not entirely serious, but it does illustrate a serious point: when we are focused simply on hypotheses, we may miss important features of the underlying data, and that's why it's a good idea to start with some simple tabulations and visualisations (and I will admit, I'm as guilty as anyone of jumping this step, especially for datasets that I think I know well).

Yanai and Lercher draw a distinction between 'day science' and 'night science':

There is a hidden cost to having a hypothesis. It arises from the relationship between night science and day science, the two very distinct modes of activity in which scientific ideas are generated and tested, respectively... With a hypothesis in hand, the impressive strengths of day science are unleashed, guiding us in designing tests, estimating parameters, and throwing out the hypothesis if it fails the tests. But when we analyze the results of an experiment, our mental focus on a specific hypothesis can prevent us from exploring other aspects of the data, effectively blinding us to new ideas. A hypothesis then becomes a liability for any night science explorations... Night science has its own liability though, generating many spurious relationships and false hypotheses. Fortunately, these are exposed by the light of day science, emphasizing the complementarity of the two modes, where each overcomes the other’s shortcomings.

So, when we look at data, we need to do so both by day and by night.

[HT: David McKenzie at Development Impact, especially for the clever title of this post]