Large Language Models Are Not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, Zhifang Sui

Venue: Annual Meeting of the Association for Computational Linguistics (ACL) 2024
Recognition: Most Influential ACL 2024 Paper (Rank No. 3)
Edition: 2026-03
Impact factor: 7
Certificate ID: dd6f7f474ef0c4a8

Abstract

In this paper, we uncover a positional bias in the evaluation paradigm of adopting large language models (LLMs), e. g. , GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e. g. , Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. We propose a simple yet effective calibration framework to address our discovered positional bias. To evaluate the effectiveness of our framework, we manually annotate the �win/tie/lose� outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark�s question prompt. Extensive experiments demonstrate that our approach successfully alleviates evaluation bias, resulting in closer alignment with human judgments.

Download PDF certificate