For all the breathless talk of artificial intelligence replacing researchers, a sobering new assessment published by Nature on April 13 delivers a blunt verdict: human scientists still trounce the best AI agents when it comes to the complex, messy work of real scientific discovery. The finding, drawn from the 2026 Stanford AI Index Report and corroborated by multiple independent benchmarks, arrives at a moment when labs and universities are racing to deploy autonomous AI agents across every stage of the research pipeline.
The Nature analysis, a state-of-the-industry report synthesizing data from Stanford's Human-Centered AI Institute and peer-reviewed evaluations, found that on the most demanding scientific benchmarks, the best AI agents perform only about half as well as experts with PhDs. That gap persists even as frontier models have conquered narrower academic tests -- passing PhD-level multiple-choice exams and earning gold medals at the International Mathematical Olympiad.
The Numbers Tell the Story
The disconnect between exam performance and practical scientific ability is stark. On DiscoveryWorld, a benchmark developed by Peter Jansen at the Allen Institute for AI (Ai2) that requires agents to form hypotheses, design experiments, execute them, and analyze results, the best AI systems complete only around 20% of tasks at higher difficulty levels. Average human scientists with advanced degrees solve roughly 70% of those same challenges -- a 50-percentage-point chasm.
On MLE-bench, which challenges AI systems with 75 real-world data science competitions drawn from Kaggle, the best-performing setup -- OpenAI's o1-preview with AIDE scaffolding -- achieved at least the level of a Kaggle bronze medal in just 16.9% of competitions. Meanwhile, on the OSWorld benchmark for autonomous computer tasks, AI agents jumped from 12% to approximately 66% task success over the past year, yet still fail roughly one in three attempts on structured tasks that humans handle routinely.
Even on Humanity's Last Exam, a collection of the hardest expert-level questions across disciplines, the best AI models top out around 35% accuracy while human domain experts average approximately 90% -- exposing a gap of more than 50 points.
The Jagged Frontier
Researchers describe the problem as a "jagged frontier" of AI capability. Models that can solve extraordinarily complex mathematics problems fail at surprisingly simple tasks. The top AI model reads analog clocks correctly only 50.1% of the time. This unpredictability is precisely what makes autonomous deployment in science risky.
"We generally lack measures of how well a system, or agent, needs to function in a particular setting," said Ray Perrault, co-director of the Stanford AI Index steering committee, cautioning that strong benchmark scores may not translate to real-world research effectiveness.
Peter Jansen of Ai2, whose DiscoveryWorld benchmark has been cited nearly 80 times since its 2024 release, put the gap in sharper terms. His research shows that AI agents possess impressive "book smarts" -- they can ace standardized science exams -- but lack the "street smarts" required for genuine scientific inquiry. Agents that received A grades on ARC science exams initially failed over 90% of practical ScienceWorld tasks when those benchmarks launched in 2022, though frontier models have since climbed into the low 80s on that earlier test.
Adoption Outpaces Capability
Despite these limitations, researchers have enthusiastically embraced AI tools. The number of publications in the natural sciences mentioning AI grew almost 30-fold from 2010 to 2025, and AI-related computer science publications more than doubled over the past decade, from 102,000 to 258,000. Many scientists now rely on AI agents to autonomously carry out scientific workflows, even as the Nature report expresses skepticism about the agents' actual performance on complex tasks.
The Stanford report also documents a telling pattern in clinical AI research: of more than 500 reviewed clinical AI studies, nearly half relied on exam-style questions rather than real patient data, with only 5% using authentic clinical information. The finding suggests that the gap between AI performance on curated tests and messy real-world problems extends well beyond the laboratory bench.
What Comes Next
The picture is not entirely discouraging for AI proponents. Human-AI collaboration consistently outperforms either humans or AI working alone, according to the AgentDS technical report. The most successful approaches combine human strategic reasoning -- diagnosing modeling failures, injecting domain knowledge, making judgment calls about generalization -- with AI-assisted implementation that accelerates coding, experimentation, and iteration.
Stanford computer scientist James Zou captured the nuance: "AI excels at spotting gaps, but judgment calls still need humans."
For now, the message from the research community is clear. AI agents are powerful accelerators, but the irreplaceable core of scientific work -- the capacity to navigate ambiguity, exercise judgment under uncertainty, and connect disparate threads of knowledge into genuine insight -- remains a distinctly human advantage. The question is not whether AI will eventually close the gap, but how many premature claims of parity the field will have to walk back before it does.