← Reasoning with Machines Lab

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

NeurIPS 2025 Datasets and Benchmarks Track


LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters.

We reviewed 445 LLM benchmarks from the proceedings of top AI conferences. We found many measurement challenges, including vague definitions for target phenomena or an absence of statistical tests. We consider these challenges to the construct validity of LLM benchmarks: many benchmarks are not valid measurements of their intended targets.

We built a taxonomy of these failures and translated them into an operational checklist to help future benchmark authors demonstrate construct validity.

Does your benchmark measure up? Compare against our Construct Validity Checklist.

Methods

We conducted a systematic review, as illustrated below. We began with 46,114 articles drawn from the proceedings of ICML, ICLR and NeurIPS (accessed via proceedings websites) between 2018 and 2024, and from ACL, NAACL and EMNLP between 2020 and 2024 (accessed via ACL Anthology).

We identified and selected articles whose titles or abstracts contained the keywords 'benchmark' and either 'LLM' or 'language model.' Then, we conducted a series of manual and automated filtering steps to select 445 articles included for final review.

Systematic review process. (A) Identification and screening from relevant proceedings. (B) In-depth review and annotation of included benchmarks. A phenomenon is operationalised via a task, scored with a metric, to support a claim about this phenomenon. (C) Synthesis of best practices.

Results

We reviewed each benchmark article with a twenty-one item questionnaire. Twenty-nine experts in NLP and ML contributed to this effort.

Summary of reviewed articles. (A) Three most common categories of benchmark phenomena, grouped into general capabilities, general applications, and specific applications. (B) Number of articles by publication year and number which discuss the construct validity of their benchmark.

Key codebook results. The distribution of codebook responses on selected items. In each column, the options are ordered from most to least preferred for high construct validity. The shaded area indicates the benchmarks that follow the best practices for all five items.

Construct validity checklist

Informed by our systematic review, we provide eight recommendations to ensure the construct validity of your benchmark:

  • Define the phenomenon
  • Measure only the phenomenon
  • Construct a representative dataset for the task
  • Acknowledge limitations of reusing datasets
  • Prepare for contamination
  • Use statistical methods to compare models
  • Conduct an error analysis
  • Justify construct validity

See our interactable checklist, including PDF and LaTeX versions, on this page.

Andrew M. Bean1, Ryan Othniel Kearns1, Angelika Romanou2, Franziska Sofia Hafner1, Harry Mayne1, Jan Batzner3,4, Negar Foroutan2, Chris Schmitz5, Karolina Korgul1, Hunar Batra1, Oishi Deb1, Emma Beharry6, Cornelius Emde1, Thom Foster1, Anna Gausen7, María Grandury8,9, Simeng Han10, Valentin Hofmann11,12, Lujain Ibrahim1, Hazel Kim1, Hannah Rose Kirk1,7, Fangru Lin1, Gabrielle Kaili-May Liu10, Lennart Luettgau7, Jabez Magomere1, Jonathan Rystrøm1, Anna Sotnikova2, Yushi Yang1, Yilun Zhao10, Adel Bibi1, Antoine Bosselut2, Ronald Clark1, Arman Cohan10, Jakob Foerster1, Yarin Gal1,7, Scott A. Hale1,13, Inioluwa Deborah Raji14, Chris Summerfield1,7, Philip H.S. Torr1, Cozmin Ududec7, Luc Rocher1, Adam Mahdi1
1 University of Oxford  ·  2 EPFL  ·  3 Weizenbaum Institute Berlin  ·  4 Technical University Munich  ·  5 Centre for Digital Governance, Hertie School  ·  6 Stanford University  ·  7 UK AI Security Institute  ·  8 SomosNLP  ·  9 Universidad Politécnica de Madrid  ·  10 Yale University  ·  11 Allen Institute for AI  ·  12 University of Washington  ·  13 Meedan  ·  14 UC Berkeley

Citation