benchmarks

| AI | Mar 09, 2026

Claude is wildly outperforming GPT and Gemini in BullshitBench. Its creator thinks some vendors may be losing touch with fundamentals.

| AI | Nov 04, 2025

"Reasoning models cannot determine when or how to revise what they have learned"

| agentic ai | Oct 20, 2025

The holistic method addresses holes in AI agent benchmarks like cost and methods used.

| mícrosoft | Oct 16, 2025

Open-source benchmark uses almost real-world data and a simulated SOC.

| AI | Sep 09, 2025

Trillion-dollar industry can't even measure its biggest problem -- "calibration" not the answer, says OpenAI

| mícrosoft | May 23, 2025

The AI exam could be used to inform new regulations

| AI | Jan 13, 2025

Intentional? Or incidental to the nature of large-scale data scraping?