LLMs & GenAI

OpenAI Extends Responses API With Shell Tool, Agent Skills, and Server-Side Compaction

4 min read850 words4 sources

Responses API now enables production-grade autonomous agents with shell tools, skills as folder bundles, and multi-hour session management.

OpenAI announced significant extensions to the Responses API, positioning it for production-grade autonomous agent development. The new features include a shell tool for direct system access, agent skills bundled as SKILL.md folders, and server-side compaction for multi-hour sessions.

The shell tool provides agents with a Debian 12 environment including Python 3.11, Node 22, Java 17, Go 1.23, and Ruby 3.1. This environment access enables agents to execute real work on computers, moving beyond language-only interactions to actual system operations.

Skills represent a significant architectural improvement. Rather than embedding agent capabilities as code snippets, developers now bundle skills as SKILL.md folders containing all necessary configuration, documentation, and supporting files. This modular approach improves code organization and skill reusability across projects.

Server-side compaction solves a critical problem for long-running agents. Multi-hour sessions accumulate token overhead from previous conversations. Compaction automatically condensenses conversation history while retaining essential context, enabling agents to handle extended operations without prohibitive cost growth.

"Long-running agents get dramatically more useful when they can both follow procedures and do real work on a computer." — OpenAI Developers

Real-world implementations demonstrate the capabilities. Triple Whale's Moby agent handled a 5 million token session with 150 tool calls, processing real business data across an extended conversation. Glean improved its search accuracy from 73 percent to 85 percent by leveraging the expanded API capabilities.

These features directly address limitations that constrained earlier agent deployments. Before, agents were limited to thinking and writing—valuable but incomplete. Now they can observe system state, take actions, and adapt based on real-world feedback.

The compaction feature removes a significant scaling bottleneck. Organizations deploying agents for long-running tasks no longer face exponential cost growth from accumulated context. This makes economically viable use cases like 24/7 monitoring, iterative research, and extended customer interactions practical.

5M
Tokens in Session
150
Tool Calls
18.1%
TTFT Reduction
20%
Accuracy Improvement

Developers adopting these features should focus on error handling and monitoring. Agents with system access require robust logging and alerts when operations fail or behave unexpectedly. Clear failure modes prevent cascading errors that could compound across long sessions.

The skills bundling approach particularly benefits teams building shared agent infrastructure. Teams can now maintain a library of production-tested skills with clear interfaces and documentation, reducing friction when multiple teams deploy agents for different purposes.

Sources