AI benchmarking scandal: Were top models caught cheating?

AI benchmarking scandal: Were top models caught gaming the system?

Intentional? Or incidental to the nature of large-scale data scraping?

Tristan Greene

Jan 13, 2025 - 6 min read

State-of-the-art artificial intelligence (AI) models from Alibaba, Google, Meta, Microsoft, Mistral AI, and OpenAI have come under recent scrutiny for allegedly “cheating” on AI benchmarking tests, writes Tristan Greene.

Evidence presented by whistleblowers and analysts demonstrates that specific AI models can be made to output the test sets for at least two popular benchmarks — MMLU and GSM8K. At a minimum, they say, this indicates data contamination and calls into question the veracity of each models’ benchmark scores. In the worst case, it could be indicative of widespread deceit in the corporate AI sector.

Get the full story: Subscribe for free

Join peers managing over $100 billion in annual IT spend and subscribe to unlock full access to The Stack’s analysis and events.

Subscribe now

Already a member? Sign in