Monday, 3 July 2023

Have large language models killed online data collection?

Data is the lifeblood of empirical social science research. Whether it be quantitative or qualitative data, or both, you couldn't do empirical research without it. Self-evidently, the quality of data matters. As the saying goes, garbage-in-garbage-out. You want high quality data to analyse. So, this new working paper by Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West (all École Polytechnique Fédérale de Lausanne) should be causing some disquiet, especially among those who use Amazon mTurk and similar sources for generating data, because in the paper the authors:

...quantify the usage of LLMs by crowd workers through a case study on MTurk, based on a novel methodology for detecting synthetic text. In particular, we consider part of the text summarization task from Horta Ribeiro et al. (2019), where crowd workers summarized 16 medical research paper abstracts. By combining keystroke detection and synthetic text classification, we estimate that 33-46% of the summaries submitted by crowd workers were produced with the help of LLMs.

Yikes! Between one-third and one-half of mTurk workers are already using large language models (LLMs) like ChatGPT to complete their work. It is easy to see that using mTurk for collecting data from experiments, surveys, etc. has just become untenable. At least, it is untenable if researchers want data collecting from real humans, rather than from LLMs masquerading as humans.

It gets worse though. It isn't just mTurk where this is likely to be a problem. Any online survey is now vulnerable to being completed by a LLM, rendering most online data collection fraught. Journal editors and reviewers will no doubt become aware of this in the future (if they aren't already), so publishing research based on data collected from humans using online methods is going to become a whole lot harder to get published in future.

It's not going to end there. Since LLMs are now generating a non-trivial proportion of online content, a lot of online data is going to lose credibility. And, to top it all off, if future LLMs are being trained on internet-sourced data, they will effectively be being trained on data that is partially generated by today's relatively-low-quality LLMs. There doesn't seem to be much of a way around this.

Anyway, getting back to the Veselovsky et al. article, they aren't as negative in their conclusions as I am above:

All this being said, we do not believe that this will signify the end of crowd work, but it may lead to a radical shift in the value provided by crowd workers.

I guess it depends on what you want the crowd workers to do. As I said above, they won't be contributing much of value to researchers in the future (unless the researchers are researching LLMs). Part of the lifeblood of social science research is bleeding away.

[HT: Marginal Revolution]

No comments:

Post a Comment