Nature Study Finds Human Scientists Still Outperform Best AI Agents on Complex Research Tasks

The best AI agents in the world score roughly half as well as human scientists with PhDs when tasked with complex, multi-step research workflows, according to a sweeping state-of-the-industry analysis published by Nature this month. The finding, drawn from the Stanford Institute for Human-Centered AI's 2026 AI Index Report, arrives at a moment when billions of dollars in venture capital are flowing toward the promise that AI agents will soon automate scientific discovery itself.

The gap is not marginal. On PaperArena, a benchmark that requires AI agents to reason across multiple scientific papers using tools like PDF parsers, database queries, and web searches, the top-performing multi-agent system built on Google's Gemini 2.5 Pro achieved just 38.8 percent accuracy. Human PhD experts scored 83.5 percent on the same tasks. On the hardest subset of problems, agent performance collapsed to 18.5 percent. On ReplicationBench, which tests whether AI systems can autonomously reproduce published astrophysics results from scratch, the best agents scored below 20 percent.

"The data does not point in a single direction," said Yolanda Gil, a computer scientist at the University of Southern California who co-chaired the Stanford HAI report. The observation captures the central paradox of AI in science today: the same frontier models that now match or exceed human performance on PhD-level multiple-choice science questions and competition mathematics cannot reliably chain together the six or seven sequential steps required to actually do research.

The Jagged Frontier of AI Capability

Researchers have begun calling this pattern "the jagged frontier" -- a landscape where AI systems demonstrate superhuman performance on narrow, well-defined tasks while failing unpredictably on broader workflows that require sustained reasoning, planning, and self-correction. The Nature analysis highlights a striking example: frontier AI models can win gold medals at the International Mathematical Olympiad but read analog clocks correctly only 50.1 percent of the time.

The implications for scientific research are profound. While an AI system might instantly answer a complex chemistry question drawn from a textbook, it struggles when asked to design an experiment, execute it computationally, interpret ambiguous results, and revise its approach -- the iterative cycle that defines real scientific work. On UnivEarth, a benchmark for earth observation research, LLM agents answered questions with just 33 percent accuracy, and their generated code failed 58 percent of the time.

"We don't know a lot of things about predicting model behaviors," Gil noted, underscoring that even the researchers building these systems cannot reliably forecast where they will succeed and where they will break down.

AI Adoption Is Surging Despite the Limitations

The performance gap has not slowed adoption. The Nature report found that more than 80,000 papers, preprints, and other publications in the natural sciences mentioned AI in 2025, a 26 percent increase over 2024. The number of publications in the life, physical, and earth sciences mentioning AI grew by a factor of nearly 30 between 2010 and 2025. Across natural science fields, between 6 and 9 percent of all publications now reference artificial intelligence.

A separate study published in Nature earlier this year, analyzing 41.3 million research papers, found that scientists who engage in AI-augmented research publish 3.02 times more papers, receive 4.84 times more citations, and become research project leaders 1.37 years earlier than their peers. But the same study identified a troubling tradeoff: as individual scientists become more productive with AI tools, the collective focus of science narrows. AI appears to be concentrating research attention rather than expanding it.

The paradox extends to AI agents more broadly. On OSWorld, a benchmark for general computer use, AI agents improved from a 12 percent task success rate to roughly 66 percent -- but that still means they fail one out of every three structured tasks. On Terminal-Bench, which measures real-world task completion, success rates climbed from 20 percent in 2025 to 77.3 percent. Progress is real, but reliability remains elusive.

What This Means for the Future of AI in Science

The Nature analysis challenges a narrative that has dominated AI industry discourse: that autonomous AI scientists are imminent and that human researchers will soon be augmented or replaced by agents capable of end-to-end discovery. The benchmarking data tells a more nuanced story. AI systems are powerful tools for accelerating specific subtasks -- literature review, data analysis, hypothesis generation -- but they cannot yet serve as independent researchers.

The distinction matters for funding decisions, laboratory design, and career planning across the sciences. If AI agents are collaborators rather than replacements, the optimal research environment looks very different from one designed around full automation. It suggests that the most productive near-term investment is in human-AI teaming rather than autonomous agent development.

The Stanford HAI report also raises a governance question. As AI tools become embedded in the scientific process, their tendency to narrow research focus could have long-term consequences for the diversity of scientific inquiry. The tools that make individual scientists more productive might simultaneously make science as a whole less exploratory.

What to Watch

Three developments will determine how quickly the gap closes. First, whether next-generation reasoning models from OpenAI, Google, and Anthropic can maintain coherence across longer chains of scientific reasoning. Second, whether new benchmarks like PaperArena and ReplicationBench become standard evaluation tools that force AI labs to optimize for real scientific workflows rather than question-answering. And third, whether the narrowing effect on scientific focus intensifies as AI adoption grows, or whether researchers develop strategies to counteract it. The 2027 AI Index Report will be the first to capture a full year of data from the current generation of reasoning-capable agents, making it the definitive test of whether the gap between AI and human scientists is closing or merely shifting to new terrain.

"The data does not point in a single direction."

— Yolanda Gil, Co-chair, Stanford HAI 2026 AI Index Report

38.8% vs 83.5%

AI vs PhD expert on PaperArena

80,000+

Science papers mentioning AI in 2025

3.02x

More papers with AI-augmented research

<20%

Best AI score on ReplicationBench

The Jagged Frontier of AI Capability

AI Adoption Is Surging Despite the Limitations

What This Means for the Future of AI in Science

What to Watch

Sources