March 17, 2025Where's Your Pre-registration: A Physicist's Notes from the Cheap Seats on AI's Benchmarking Crisis
The post critiques AI evaluation methods from a physicist's perspective, highlighting a troubling lack of scientific rigor compared to fields like physics. While physicists meticulously define success criteria before experiments (like CERN's specific statistical requirements for the Higgs boson), AI benchmarking suffers from three critical problems:
Benchmarks are abandoned once models perform well, creating an endless cycle without measuring meaningful progress.
With models training on vast internet data, benchmarks are likely contaminated, essentially giving open-book exams to models that have already seen the material.
Current methods fail to properly measure generalization - whether models truly understand concepts versus memorizing patterns.
The author proposes a "Standard Model of AI Evaluation" bringing together cognitive scientists, AI researchers, philosophers, and evaluation experts to create hypothesis-driven benchmarks rather than difficulty-driven ones. This framework would require pre-registered hypotheses, contamination prevention strategies, and clearly defined success criteria.
The post concludes by asking whether systems potentially transforming society deserve evaluation standards at least as rigorous as those used for testing new particles.