Saturday 21 August 2021

The past and future of statistical significance

The latest issue of the Journal of Economic Perspectives had a symposium on statistical significance, which included three articles (all ungated). In the first article, Guido Imbens (Stanford University) outlines three concerns with the use of statistical significance and the use of p-values:

The first concern is that often p-values and statistical significance do not answer the question of interest. In many cases, researchers are interested in a point estimate and the degree of uncertainty associated with that point estimate as the precursor to making a decision or recommendation to implement a new policy. In such cases, the absence or presence of statistical significance (in the sense of being able to reject the null hypothesis of zero effect at conventional levels) is not relevant, and the all-too-common singular focus on that indicator is inappropriate...

The second concern arises if a researcher is legitimately interested in assessing a null hypothesis versus an alternative hypothesis... Questions have been raised whether p-values and statistical significance are useful measures for making the comparison between the null and alternative hypotheses...

The third concern is the abuse of p-values... To put it bluntly, researchers are incentivized to find p-values below 0.05.

These are all concerns that are not new, and relate to the case made in the book The Cult of Statistical Significance by Stephen Ziliak and Dierdre McCloskey (which I reviewed here). Imbens argues that the first concern is the most important. Interestingly, he takes a more moderate view than others have done in recent years:

In my view, banning p-values is inappropriate. As I have tried to argue in this essay, I think there are many settings where the reporting of point estimates and confidence (or Bayesian) intervals is natural, but there are also other circumstances, perhaps fewer, where the calculation of p-values is in fact the appropriate way to answer the question of interest.

Confidence intervals do make a lot of sense. However, to me they are still not so much different to a p-value (the 95% confidence interval is just as arbitrary as a p-value of 0.05).

In the second article, Maximilian Kasy (University of Oxford) discusses the problems arising from the 'forking path'. The forking path is a metaphor drawn from Jorge Luis Borges, who:

...wrote a short story in 1941 called “The Garden of Forking Paths.” The plot involves (among other elements) a journey in which the road keeps forking...

Statisticians have used the metaphor from Borges to convey how empirical research also involves a garden of forking paths: how data is chosen and prepared for use, what variables are the focus of inquiry, what statistical methods are used, what results are emphasized in writing up the study, and what decisions are made by journal editors about publication.

Essentially, this article is about the selection bias in published research, arising from reporting bias (where only some, but not all, statistical results are reported in published studies) and publication bias (where only some, but not all, studies are published). Kasy outlines the problems (which again, are well known to researchers), and then some potential solutions, including: (1) pre-analysis plans, where the analyses are pre-specified and deviations along the forking paths can easily be identified by editors, journal reviewers, and readers of research; (2) pre-results journal review (or 'registered reports'), where journal articles are accepted on the basis of proposed analyses, before any analysis is conducted or results are available; and (3) journals for null results and replication studies, which could reduce the publication bias and help to identify studies with fragile results.

Kasy finishes by making an alternative proposal for the structure of publishing:

There might be a set of top outlets focused on publishing surprising (“relevant”) findings, subject to careful quality vetting by referees. These outlets would have the role of communicating relevant findings to attention-constrained readers (researchers and decision-makers). A key feature of these outlets would be that their results are biased by virtue of being selected based on surprisingness. In fact, this is likely to be true for prominent outlets today, as well. Readers should be aware that this is the case: “Don’t take findings published in top outlets at face value.”

There might then be another wider set of outlets that are not supposed to select on findings but have similar quality vetting as the top outlets, thus focusing on validity and replicability. For experimental studies, pre-analysis plans and registered reports (results-blind review) might serve as institutional safeguards to ensure the absence of selectivity by both researchers and journals. Journals that explicitly invite submission of “null results” might be an important part of this tier of outlets. This wider set of outlets would serve as a repository of available vetted research and would not be subject to the biases induced by the selectivity of top outlets...

To make the findings from this wider set of publications available to attention-constrained decision-makers, systematic efforts at aggregating findings in review articles and meta-studies by independent researchers would be of great value... Lastly, systematic replication studies can serve as a corrective for the biases of top publications and as a further safeguard to check for the presence of selectivity among non-top publications.

I'm not sure how workable that system is, or how we would get to there from where we are today. Some of the elements are already in place, and replications and systematic meta-analyses are becoming more common. However, there would be substantial reluctance on top journals to be seen as publishing research that is 'biased by surprisingness'.

The third article, by Edward Miguel (University of California, Berkeley) focuses on open science and research transparency. This covers some of the same ground as Kasy's article (pre-analysis plans, and registered reports) but also covers the availability of statistical code and data to be used for replication. Miguel notes that there has been an increase over time in the sharing of data and code, but he also notes that it is not without cost:

Across 65 project datasets, the average amount of time to prepare replication materials for public sharing was 31.5 hours, with an interquartile range of 10.0 to 40.5 hours (and a 10th to 90th percentile range of 5.8 to 80.2 hours). This is non-trivial for most projects: still, remember that this estimate of preparation time applies to field experiments that often require multiple years of work on collecting data, so it remains a very small share of overall project work time.

There are offsetting benefits though:

The most immediate private benefit that I and many other scholars have personally experienced from new open data norms is the fact that our own research data is better organized for ourselves and thus easier to reuse for other analyses and papers as a result of the effort that is put into improved documentation (like the README files and other replication materials).

There are a couple of salient issues here, both of which Miguel touches upon. The first issue is equity - producing replication materials is likely to be lower cost for researchers who employ a small army of research assistants, whereas many researchers would have to do this work themselves. The second issue relates to replication more generally, where:

...researchers’ growing ability to access data and code from previous studies has led to some controversy... there may be “overturn bias,” in which reanalysis and replications that contradict an originally published paper are seen as more publishable.

This is related to the 'surprisingness' that Kasy notes. The last thing we would want is that the journals that are devoted to replication have a bias towards negative findings (so, then there would be publication biases in both directions).

Overall, there is a lot of interest in the three articles in this symposium. It is not all bad news - an optimistic view would be that many of the problems with statistical significance, publication bias, etc. are already acknowledged, and steps are already being undertaken to address these. The biggest thing that researchers can do going forward, though, is to be a bit more sensible in relation to interpreting statistical significance. The difference between a p-value of 0.049 and a p-value of 0.051 is not in itself statistically significant. As Ziliak and McCloskey noted in their book (and a point that was only raised by Imbens of the three authors in this symposium), it is economic significance, rather than statistical significance, that is most important.

No comments:

Post a Comment