Measuring what Matters

Methods

We conducted a systematic review, as illustrated below. We began with 46,114 articles drawn from the proceedings of ICML, ICLR and NeurIPS (accessed via proceedings websites) between 2018 and 2024, and from ACL, NAACL and EMNLP between 2020 and 2024 (accessed via ACL Anthology).

We identified and selected articles whose titles or abstracts contained the keywords 'benchmark' and either 'LLM' or 'language model.' Then, we conducted a series of manual and automated filtering steps to select 445 articles included for final review.

Systematic review process. (A) Identification and screening from relevant proceedings. (B) In-depth review and annotation of included benchmarks. A phenomenon is operationalised via a task, scored with a metric, to support a claim about this phenomenon. (C) Synthesis of best practices.

Results

We reviewed each benchmark article with a twenty-one item questionnaire. Twenty-nine experts in NLP and ML contributed to this effort.

Summary of reviewed articles. (A) Three most common categories of benchmark phenomena, grouped into general capabilities, general applications, and specific applications. (B) Number of articles by publication year and number which discuss the construct validity of their benchmark.

Key codebook results. The distribution of codebook responses on selected items. In each column, the options are ordered from most to least preferred for high construct validity. The shaded area indicates the benchmarks that follow the best practices for all five items.

Construct validity checklist

Informed by our systematic review, we provide eight recommendations to ensure the construct validity of your benchmark:

Define the phenomenon
Measure only the phenomenon
Construct a representative dataset for the task
Acknowledge limitations of reusing datasets
Prepare for contamination
Use statistical methods to compare models
Conduct an error analysis
Justify construct validity

See our interactable checklist, including PDF and LaTeX versions, on this page.

Andrew M. Bean¹, Ryan Othniel Kearns¹, Angelika Romanou², Franziska Sofia Hafner¹, Harry Mayne¹, Jan Batzner^3,4, Negar Foroutan², Chris Schmitz⁵, Karolina Korgul¹, Hunar Batra¹, Oishi Deb¹, Emma Beharry⁶, Cornelius Emde¹, Thom Foster¹, Anna Gausen⁷, María Grandury^8,9, Simeng Han¹⁰, Valentin Hofmann^11,12, Lujain Ibrahim¹, Hazel Kim¹, Hannah Rose Kirk^1,7, Fangru Lin¹, Gabrielle Kaili-May Liu¹⁰, Lennart Luettgau⁷, Jabez Magomere¹, Jonathan Rystrøm¹, Anna Sotnikova², Yushi Yang¹, Yilun Zhao¹⁰, Adel Bibi¹, Antoine Bosselut², Ronald Clark¹, Arman Cohan¹⁰, Jakob Foerster¹, Yarin Gal^1,7, Scott A. Hale^1,13, Inioluwa Deborah Raji¹⁴, Chris Summerfield^1,7, Philip H.S. Torr¹, Cozmin Ududec⁷, Luc Rocher¹, Adam Mahdi¹

¹ University of Oxford · ² EPFL · ³ Weizenbaum Institute Berlin · ⁴ Technical University Munich · ⁵ Centre for Digital Governance, Hertie School · ⁶ Stanford University · ⁷ UK AI Security Institute · ⁸ SomosNLP · ⁹ Universidad Politécnica de Madrid · ¹⁰ Yale University · ¹¹ Allen Institute for AI · ¹² University of Washington · ¹³ Meedan · ¹⁴ UC Berkeley

Citation

@inproceedings{
bean2025measuring,
title={Measuring what Matters: Construct Validity in Large Language Model Benchmarks},
author={Andrew M. Bean and Ryan Othniel Kearns and Angelika Romanou and Franziska Sofia Hafner and Harry Mayne and Jan Batzner and Negar Foroutan and Chris Schmitz and Karolina Korgul and Hunar Batra and Oishi Deb and Emma Beharry and Cornelius Emde and Thomas Foster and Anna Gausen and Mar{\'\i}a Grandury and Simeng Han and Valentin Hofmann and Lujain Ibrahim and Hazel Kim and Hannah Rose Kirk and Fangru Lin and Gabrielle Kaili-May Liu and Lennart Luettgau and Jabez Magomere and Jonathan Rystr{\o}m and Anna Sotnikova and Yushi Yang and Yilun Zhao and Adel Bibi and Antoine Bosselut and Ronald Clark and Arman Cohan and Jakob Nicolaus Foerster and Yarin Gal and Scott A. Hale and Inioluwa Deborah Raji and Christopher Summerfield and Philip Torr and Cozmin Ududec and Luc Rocher and Adam Mahdi},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=mdA5lVvNcU}
}