As the Managing Editor of a journal (the Australasian Journal of Regional Studies), I have been watching the artificial intelligence space with interest. One thing that AI could easily be used for is peer review. So far, I haven't seen any evidence that reviewers for my journal have been using AI to complete their reviews, but I know that it is becoming increasingly common (and was something that I did observe as a member of the Marsden Social Sciences panel, before it was disbanded).
What could be so bad about AI completing peer review of research? The truth is that we don't know the answer to that question. As editors and researchers, we may have concerns about whether AI would do a quality job, whether it would be biased for or against certain types of research and certain types of researchers. But really, there hasn't been a lot of empirical support for these concerns, or against them.
That's why I was really interested to read two papers that contribute to our understanding of AI in peer review. The first is this 2024 article by Weixin Liang (Stanford University) and co-authors, published in the journal NEJM AI (ungated earlier version here). Their dataset of human feedback was based on over 8700 reviews of over 3000 accepted papers from Nature family journals (which had published their peer review reports), and over 6500 reviews of 1700 papers from the International Conference on Learning Representations (ICLR), a large machine learning conference (and where the authors had access to review reports for accepted as well as rejected papers). They quantitatively compared the human feedback with feedback generated by GPT-4.
Liang et al. found that, for the Nature journal dataset:
More than half (57.55%) of the comments raised by GPT-4 were raised by at least one human reviewer... This suggests a considerable overlap between LLM feedback and human feedback, indicating potential accuracy and usefulness of the system. When comparing LLM feedback with comments from each individual reviewer, approximately one third (30.85%) of GPT-4 raised comments overlapped with comments from an individual reviewer... The degree of overlap between two human reviewers was similar (28.58%), after controlling for the number of comments...
For the ICLR dataset, the results were similar, but the nature of the data allowed for more nuance:
Specifically, papers accepted with oral presentations (representing the top 5% of accepted papers) have an average overlap of 30.63% between LLM feedback and human feedback comments. The average overlap increases to 32.12% for papers accepted with a spotlight presentation (the top 25% of accepted papers), while rejected papers bear the highest average overlap at 47.09%. A similar trend was observed in the overlap between two human reviewers: 23.54% for papers accepted with oral presentations (top 5% accepted papers), 24.52% for papers accepted with spotlight presentations (top 25% accepted papers), and 43.80% for rejected papers.
So, GPT-4 was very good at identifying the worst papers (those that should be rejected), and had a similar extent of overlap in comments with a human reviewer as another human reviewer would. Turning to the types of comments, Liang et al. find that:
LLM comments on the implications of research 7.27 times more frequently than humans do. Conversely, LLM is 10.69 times less likely to comment on novelty than humans are... This variation highlights the potential advantages that a human-AI collaboration could provide. Rather than having LLM fully automate the scientific feedback process, humans can raise important points that LLM may overlook. Similarly, LLM could supplement human feedback by providing more comprehensive comments.
The takeaway message here is that GPT-4 is not really a substitute for a human reviewer, but is a useful complement to human reviewing. Finally, Liang et al. conducted a survey of 308 researchers across 110 US universities, who could upload some research and receive AI feedback. As Liang et al. explain:
Participants were surveyed about the extent to which they found the LLM feedback helpful in improving their work or understanding of a subject. The majority responded positively, with over 50.3% considering the feedback to be helpful, and 7.1% considering it to be very helpful... When compared with human feedback, while 17.5% of participants considered it to be inferior to human feedback, 41.9% considered it to be less helpful than many, but more helpful than some human feedback. Additionally, 20.1% considered it to be about the same level of helpfulness as human feedback, and 20.4% considered it to be even more helpful than human feedback...
In line with the helpfulness of the system, 50.5% of survey participants further expressed their willingness to reuse the system...
And interestingly:
Another participant wrote, “After writing a paper or a review, GPT could help me gain another perspective to re-check the paper.”
I hadn't really considered running my research papers through generative AI to see if it could provide feedback. However, now that I've heard about it, it is completely obvious that I should do so. And so should other researchers. It's a low-cost form of internal feedback. Indeed, Liang et al. conclude that:
...LLM feedback should be primarily used by researchers identify areas of improvements in their manuscripts prior to official submission.
The second paper is this new working paper by Pat Pataranutaporn (MIT), Nattavudh Powdthavee (Nanyang Technological University), and Pattie Maes (MIT). They undertook an experimental evaluation of AI peer review of economics research articles, in order to determine the ability of AI to distinguish the quality of research, and whether it would be biased by non-quality characteristics of the papers it reviewed.
To do this, Pataranutaporn et al.:
...randomly selected three papers each from Econometrica, Journal of Political Economy, and Quarterly Journal of Economics (“high-ranked journals” based on RePEc ranking) and three each from European Economic Review, Economica, and Oxford Bulletin of Economics and Statistics (“medium-ranked journals”). Additionally, we randomly selected three papers from each of the three lower-ranked journals not included in the RePEc ranking—Asian Economic and Financial Review, Journal of Applied Economics and Business, and Business and Economics Journal (“low-ranked journals”). To complete the dataset, we included three papers generated by GPT-o1 (“fake AI papers”), designed to match the standards of papers published in top-five economics journals.
They then:
...systematically varied each submission across three key dimensions: authors’ affiliation, prominence, and gender. For affiliation, each submission was attributed to authors affiliated with: i) top-ranked economics departments in the US and UK, including Harvard University, Massachusetts Institute of Technology (MIT), London School of Economics (LSE), and Warwick University, ii) leading universities outside the US and Europe, including Nanyang Technological University (NTU) in Singapore, University of Tokyo in Japan, University of Malaya in Malaysia, Chulalongkorn University in Thailand, and University of Cape Town in South Africa... and iii) no information about the authors’ affiliation, i.e., blind condition.
To introduce variation in academic reputation, we replaced the original authors of the base articles with a new set of authors categorized into the following groups: (i) prominent economists—the top 10 male and female economists from the RePEc top 25% list; (ii) lower-ranked economists—individuals ranked near the bottom of the RePEc top 25% list; (iii) non-academic individuals—randomly generated names with no professional affiliation; and (iv) anonymous authorship—papers where author names were omitted. For non-anonymous authorship, we further varied each submission by gender, ensuring an equal split (50% male, 50% female). Combining these variations resulted in 9,030 unique papers, each with distinct author characteristics...
Pataranutaporn et al. then asked GPT4o-mini to evaluated each of the 9030 unique papers across a number of dimensions, including whether it would be accepted or rejected at a top-five journal, the reviewer recommendation, the predicted number of citations, whether the paper would attract research funding, result in a research award, strengthen an application for tenure, and be part of a research agenda worthy of a Nobel Prize in economics. They found that:
...LLM is highly effective at distinguishing between submissions published in low-, medium-, and high-quality journals. This result highlights the LLM’s potential to reduce editorial workload and expedite the initial screening process significantly. However, it struggles to differentiate high-quality papers from AI-generated submissions crafted to resemble “top five” journal standards. We also find compelling evidence of a modest but consistent premium—approximately 2–3%—associated with papers authored by prominent individuals, male economists, or those affiliated with elite institutions compared to blind submissions. While these effects might seem small, they may still influence marginal publication decisions, especially when journals face binding constraints on publication slots.
So, on the one hand the AI tool does do a good job of identifying high-quality submissions. However, it can't tell them apart from high-quality AI-generated submissions. And, it has a small but statistically significant bias towards male authors. Both of these latter points are worrying, but again they suggest that a combination of human and AI reviewers might be a suitable path forward.
Pataranutaporn et al.'s paper is focused on solving a "peer review crisis". It has become increasingly difficult to find peer reviewers who are willing to spend the time to generate a high-quality review that will in turn help to improve the quality of published research. Generative AI could help to alleviate this, but we're clearly not entirely there yet. There is still an important role for humans in the peer review process, at least for now.
[HT: Marginal Revolution, for the Pataranutaporn et al. paper]
No comments:
Post a Comment