Saturday, 1 February 2025

Junk papers are poisoning the well for meta-analyses

Regular readers of this blog will know that I am quite fond of meta-analyses. In fact, one of my PhD students will soon publish her first working paper from her thesis, which is a very ambitious (perhaps too ambitious) meta-analysis (more on that in a future post). Anyway, a meta-analysis is where a researcher combines the results of many studies to estimate an overall effect. So, while individual studies may have various biases (and that is one reason why you shouldn't over-emphasise the importance of any one study on its own), a meta-analysis is much less subject to bias.

At least, that was the case until recently. As Holly Else reported in Science last November:

Aquarius, a neuroscientist who specializes in reviewing preclinical animal research, and Wever, a metascienist, both at Radboud University Medical Center, are some of a growing number of systematic review authors who have lost faith in the evidence base they depend on. Their group put its project on hold to quantify the problem. “There is a real danger of systematic reviews losing the power they have,” Aquarius says.

The junk papers are likely the products of paper mills—businesses that produce fake science to order. The size of the problem is not clear, but a manuscript posted to the Center for Open Science’s OSF preprint server in September suggests up to one in seven published papers are fabricated or falsified...

This creates real problems for researchers who are trying to conduct credible meta-analyses. It means taking a much higher level of care with what research papers are included. However, filtering out the junk papers may also lead to 'real' papers also being filtered out, reducing the accuracy of the resulting meta-analysis.

The problem is only going to get worse. Else's article doesn't even consider the impact that generative AI is having and will continue to have on the publication process. In future, it will likely become more difficult to verify the credibility of research papers. Some will be 'real', some will have real data and analysis but be largely written by generative AI (for example, see here), while others will be the direct product of generative AI with made-up data and analyses.

Else does identify some potential solutions:

In 2023, Cochrane, an international network promoting evidence-based medicine, issued draft guidelines to help these researchers filter out junk science. The REAPPRAISED checklist, an effort by another group of research integrity specialists published in 2020, also helps researchers assess papers’ soundness.

The problem is that the guidelines don't seem well-designed to filter out AI-generated studies that appear to be of high quality. Moreover, having set guidelines just invites bad researchers to set up their junk papers to tick the right boxes to pass review.

In my view, we may need to rely on the open science movement in order to thwart the potential flood of junk papers. In particular, making data and statistical code available may become an important signal of paper quality in the future. [*] Authors would only be willing to make the data and code available if their paper is 'real'. If their paper is not 'real', then they would not be willing to do so, as they run the risk of being found out.

So, while junk papers may be poisoning the well of science right now, we might have an antidote. It is not time to despair, but for the time being researchers conducting meta-analyses will have to take a lot of extra care.

*****

[*] For a signal to be effective, it must meet two conditions: (1) it must be costly; and (2) it must be costly in such a way that those with low quality attributes would not want to attempt the signal. Making data and statistical code available seems to me to meet both of those conditions. It is costly (in terms of time and effort) to clean up and label a dataset and statistical code to be understood or used by others. And if your paper is low-quality, you wouldn't want to expend the additional time and effort necessary to make your data and code available.

No comments:

Post a Comment