Back in 2023, I wrote about the impact that ChatGPT would have on online dating. I think I've seriously undersold the idea of large language models talking on our behalf to other large language models. The broader point is illustrated well in this new working paper by Jiannan Xu (University of Maryland), Gujie Li (National University of Singapore) and Jane Yi Jiang (Ohio State University). They look at the idea of 'AI self-preference' and its impact on hiring practices, defining AI self-preference as:
...the inclination of a model to favor content it generated itself over that written by humans or produced by alternative models...
So, if ChatGPT prefers resumes written by ChatGPT rather than those written by humans, that would be AI self-preference. Given the context of hiring, Xu et al.:
...examine whether LLMs, when deployed as evaluators, systematically favor resumes they generated themselves over otherwise equivalent resumes written by humans or produced by alternative models. To test this, we construct a largescale resume correspondence experiment using a real-world dataset of 2,245 human-written resumes, sourced from a professional resume-building platform prior to the widespread adoption of generative AI. For each resume, we generate multiple counterfactual versions using a range of state-of-the-art LLMs, including GPT-4o, GPT-4o-mini, GPT-4-turbo, LLaMA 3.3-70B, Mistral-7B, Qwen 2.5-72B, and Deepseek-V3. Having content quality controlled, we assess whether these LLMs exhibit systematic bias in favor of their own outputs when acting as evaluators.
There is a lot of depth in the paper, and I encourage you to read it. However, I just want to focus on their headline results, which come from getting each model to choose between a resume where it wrote the executive summary itself and a resume where the executive summary was written by a human (or, in other comparisons, written by another AI model). The only part of the resume that was AI-generated in each case was the executive summary. To be clear, Xu et al. didn't get each model to compare the exact same resume (with different executive summaries), but two different resumes, one with an AI-generated summary and the other written by a human. Anyway, these comparisons allow Xu et al. to determine whether each AI model prefers its own output over others. And the results are striking:
...most LLMs exhibit strong self-preferencing behavior. Notably, larger or more aligned models—such as GPT-4-turbo, GPT-4o, GPT-4o-mini, DeepSeek-V3, Qwen 2.5-72B, and LLaMA-3.3-70B—demonstrate an overwhelming preference for their own outputs, with self-selection rates exceeding 96%. These high rates translate into substantial statistical parity self-preference biases exceeding 92%. In contrast, smaller or less aligned models—such as Mistral-7B, LLaMA-3.2-3B, and LLaMA 3.2-1B—display substantially lower self-preferencing bias.
Only the smallest model, LLaMA 3.2-1B showed a preference for human-generated resumes. All other models preferred their own. When given the choice between a resume where they wrote the executive summary and one where a human wrote the executive summary, the resume with the AI-generated summary gets selected over 90 percent of the time. Xu et al. go on to show that this descriptive comparison continues to hold even after controlling for the quality of resume, as well as linguistic quality and textual similarity. In those comparisons:
Larger systems—such as GPT-4o, GPT-4-turbo, DeepSeek-V3, Qwen-2.5-72B, and LLaMA 3.3-70B—exhibit particularly strong bias, exceeding 68% even after controlling for content quality and reaching over 80% for GPT-4o, Qwen-2.5-72B, and LLaMA 3.3-70B.
So, the self-preference compared to humans isn't because the models write higher-quality summaries. They really do just prefer things that they wrote themselves. Which shouldn't be surprising - like any human, they write in a style that they prefer. And so, when evaluating they also choose the style that they prefer. However, turning to the self-preference for models own writing in comparison with the writing of other AI models. the results are more mixed. Some models show a self-preference while others do not.
Does any of this matter? Xu et al. show that their results have practical significance by running a simulation, which shows that:
...candidates using the same LLM as the evaluator are about 15–68% more likely to be shortlisted than equally qualified applicants submitting human-written resumes. The disadvantage is most severe in business-related fields such as accounting, sales, and finance, and less pronounced in areas like agriculture, arts, and automotive.
Finally, Xu et al. show that the self-preference can be mitigated using two strategies:
The first strategy uses system prompting to explicitly instruct models to ignore the origin of resumes and focus only on substantive content. The second strategy employs a majority voting ensemble, combining the evaluator model with smaller models that exhibit weaker self-recognition, thereby diluting the bias of any single LLM. Across all tested LLMs, these interventions reduce LLM-vs-Human self-preference by more than 60%...
On these mitigation strategies, it is important to note that they are strategies that only the evaluator is in a position to apply, not the person whose resume is being evaluated. Xu et al. are silent on what the evaluated person should do to avoid being disadvantaged by AI self-preference.
However, despite Xu et al.'s silence on this, the implications of their results are pretty clear for job applicants. Get an AI model to write your resume executive summary! If you write the summary yourself, then your resume will be disadvantaged relative to candidates who used generative AI. If you use generative AI, but a different model than the model that is doing the evaluating, your resume will still be advantaged relative to human-written resumes. But if you happen to use the same model as the model doing the evaluating, then you maximise your advantage. That suggests a good strategy may be to try and find out which generative AI the evaluators will be using. Do they use Google Workspace? Get Gemini to write your resume. Do they use Office 365? ChatGPT might be a better option.
This extends much further than job applications and resumes. Any writing that is likely to be evaluated by generative AI will be advantaged if it is also written by generative AI. Is your promotion application going to be vetted by generative AI? Get generative AI to write your promotion application for you. Is your award application going to be shortlisted using generative AI? Get generative AI to write your award application for you. Is your research paper going to be reviewed by generative AI? Get generative AI to write your research paper for you. Is your essay or dissertation going to be graded by generative AI? Get generative AI to write your essay or dissertation for you. [*]
And that brings us full circle. If your dating profile is going to be evaluated by your potential match's generative AI, get generative AI to write your dating profile. And if your responses to conversations in the dating app are being evaluated by generative AI? You guessed it. Generative AI should be writing your dating app conversations for you.
[HT: Marginal Revolution]
*****
[*] Note for my research students: Your dissertation will not be graded by generative AI. So, getting generative AI to write your dissertation for you is not a winning strategy.
Fascinating read, thank you so much.
ReplyDelete