The third iteration of the ARC-AGI benchmark has delivered a sobering reality check for the AI industry. Every single frontier model tested scored below 1% on a suite of tasks that untrained humans solve on their first attempt. The benchmark, released this week by the ARC Prize Foundation, represents the most rigorous evaluation yet of whether current AI systems possess anything approaching general intelligence.

The results are striking in their uniformity across competing models. Google's Gemini 3.1 Pro achieved the highest score among frontier models at 0.37%, while other leading systems struggled even more dramatically. Elon Musk's Grok 4.20 scored 0%, failing to correctly solve even a single benchmark task. The gap between human and AI performance on these tasks is measured not in percentage points but in orders of magnitude.

"True AGI should not need task-specific human guidance."
— ARC Prize Foundation, Benchmark Design

The benchmark's 3,000+ visual reasoning tasks are deliberately designed to resist the approach-specific optimizations that have driven AI performance gains elsewhere. They require genuine transfer learning and abstract reasoning. By refusing to accept task-specific guidance or training data, the ARC-AGI-3 benchmark directly challenges the assumption that modern large language models are on a path toward artificial general intelligence.

With a $2M prize pool for any AI system that matches untrained human performance, the Foundation has essentially issued an ultimatum to the industry: prove your systems can think, not just pattern-match. So far, no takers. The benchmark's results suggest that despite remarkable progress in narrow domains, frontier AI models remain fundamentally unable to generalize in the ways that distinguish human reasoning. For researchers working on AGI alignment and safety, this gap offers both reassurance and urgency.

<1%
All frontier model scores
0.37%
Gemini 3.1 Pro (best)
0%
Grok 4.20 score
\$2M
Prize pool