Skip to main content

Why Your Agent Harness Should Be Stateless: Decoupling Brain from Hands in Production

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the harness — the scaffolding that handles tool routing, context management, and the inference loop — as a long-lived, stateful process tied to a single container. When the container fails, the session dies. When you need to swap in a better model, you have to restart everything. When you want to scale horizontally, you hit a wall: each harness instance knows too much about its own state to be interchangeable.

The fix isn't a smarter harness. It's a stateless one.

Agents that have run reliably in production for months share a common structural pattern: they separate what thinks from what acts, and they put all durable state outside both. The harness should be as replaceable as a stateless HTTP handler. The sandbox where code runs should know nothing about the agent session it's serving. And the session log — the record of every event that's happened — should outlive both.

This sounds like obvious engineering hygiene. In practice, almost nobody does it at first.

The Tight Coupling That Makes Agents Fragile

When you build your first agent, the natural structure is monolithic: the inference loop, tool execution, and session state all live in the same process. This works fine for short-lived tasks. It breaks down badly for anything long-running.

Here's what failure looks like. An agent is 40 minutes into a complex coding task. The container running the harness is restarted due to a routine deployment. The session is gone. The agent had no way to resume because the session state existed only in memory. The only option is to start over — which means re-running the expensive early steps, and hoping the agent converges to the same intermediate state.

The deeper problem is that the harness encodes assumptions about model capabilities. As models improve, teams find themselves needing to update harness behavior — how much context to include, which tool calls to batch, how to handle errors — but the harness is baked into a running system. Changing it risks disrupting active sessions.

This is the "pets vs. cattle" problem applied to AI infrastructure. A pet is a named, individually-tended server you can't afford to lose. Cattle are interchangeable and replaceable. Traditional agent harnesses are pets. Production systems need cattle.

The Virtualization Pattern

The solution follows a design principle borrowed from operating systems: virtualize the components to allow independent replacement.

Three things need to be separated:

The session is an append-only event log that lives outside the harness and outside the sandbox. Every user message, model response, tool call, tool result, and error gets recorded as a structured event. The session is the single source of truth. If the harness restarts, it fetches the session log and picks up exactly where it left off. If you want to replay the session with a different model, you can — the event log is model-agnostic.

The harness is the inference loop. It fetches pending events from the session, constructs a working context window (selecting which events are relevant and how to present them), calls the model, and routes tool invocations back to the session. Critically, the harness holds no durable state of its own. Any harness instance can handle any session. This makes horizontal scaling trivial: add more harness instances and they can service sessions round-robin.

The sandbox is the execution environment where the agent's tool calls actually run — file operations, code execution, shell commands. The sandbox is completely isolated from the harness. It doesn't know which model is calling it. It receives a tool invocation (execute(name, input) → string) and returns a result. That's the entire interface.

This separation produces a useful property: components fail independently. If the sandbox crashes, the harness can spin up a new one and resume from the last logged event. If you want to upgrade the harness to handle a new tool type, active sessions aren't affected. If you need to switch models, you swap the harness; the session history persists.

What This Does to Latency

The performance impact of this decoupling is larger than it might appear, and it comes from a specific optimization that's only possible when the harness and sandbox are separate.

In the coupled architecture, the harness starts inference after the sandbox (container) is provisioned. Container provisioning can take several seconds even with warm pools. This adds latency to every session start — and at the tail of the distribution (p95, p99), where provisioning is slow and queues are long, the impact is severe.

When harness and sandbox are decoupled, inference can start as soon as the harness pulls pending events from the session log. Sandbox provisioning happens on-demand, only when the model actually invokes a tool that requires one. A session that only needs to read from memory or query a fast API never provisions a sandbox at all.

The result is a dramatic reduction in time-to-first-token — in the range of 60% at the median and over 90% at the 95th percentile. For interactive agent use cases, where users are waiting for the first response, this difference is perceptible.

Context Management as a Query Problem

One underrated benefit of externalizing the session is what it does for context management.

In a coupled architecture, context management is a destructive operation. You decide which events to include in the context window and which to drop, and that decision is permanent — the events you excluded don't live anywhere else. This means you're making irreversible bets about what will matter later in the session.

When the session is an external event log, context management becomes a query problem. The harness fetches a view of the session — positional slices, filtered by event type, summarized or verbatim — without modifying the underlying log. If a later turn needs information from earlier in the session, the harness can fetch it. The raw events are always there.

This decouples the storage schema from the prompt format. You can experiment with different context compression strategies without touching the event log. You can change how events are rendered into prompts without losing any history. You can write tooling that retrospectively analyzes session logs without worrying about what the harness decided to include.

Practically, this means you can evolve your context engineering approach — which turns out to be one of the most frequently-iterated parts of a production agent — without being constrained by what you decided to store.

Security at the Execution Boundary

The sandbox isolation has security implications beyond reliability.

When the harness and sandbox are the same process, credentials used by the agent (OAuth tokens, API keys, git access tokens) necessarily live in the same memory space as the code the agent generates and executes. This creates an attack surface: malicious instructions embedded in tool outputs or web content — prompt injection — can potentially exfiltrate those credentials by instructing the model to write code that reads from memory or environment variables and sends it somewhere.

When the sandbox is isolated and communicates only through a narrow tool interface, the attack surface shrinks significantly. Credentials never reach the execution environment. Instead, they live in a secure vault that the harness accesses through a proxy. The proxy handles authentication and injects credentials only at the network boundary, for specific outbound requests. The sandbox never sees them.

For git operations, this means access tokens are bundled during repo initialization (before the sandbox exists) rather than passed as environment variables. For API calls, the proxy intercepts requests from the sandbox and attaches the appropriate credentials based on the destination, without the agent knowing the actual token values.

This pattern doesn't eliminate prompt injection risks — a compromised agent can still take harmful actions within its scope — but it prevents credential theft as an attack outcome, which is often the highest-value target.

Building Toward Replaceability

The deeper reason to adopt this architecture is resilience against model capability changes.

Agent harnesses written in 2023 encoded assumptions that made sense at the time: models needed explicit state tracking because they couldn't maintain coherent context across many turns; tool calls had to be formatted in specific ways because models didn't reliably follow free-form instructions; error recovery had to be scripted because models couldn't reason about failures. Many of those assumptions are now wrong for current models, and will be wrong in different ways for future models.

A harness baked into a stateful, long-running process is hard to update without disrupting sessions. A stateless harness is just a function: events in, model call out, results logged. You can replace that function without touching the session log. You can run two harness implementations in parallel — routing some sessions to the new one, keeping others on the old one — and compare behavior directly.

This is the key property that operating system abstractions provide: the read() system call has the same interface whether it's reading from a 1970s disk pack or a modern SSD. The abstraction outlasts the implementations beneath it. A well-designed harness interface can outlast several generations of underlying model capability.

The architectural work to separate these components is real — it requires rethinking how state flows through the system and investing in external session storage. But the payoff is a system that can evolve as models improve, without requiring you to rebuild from scratch each time the model capabilities shift.

What to Build First

If you're starting a new agent system, the minimal viable implementation of this pattern is:

  • Persist every event to an external store (even a database table works) before processing it
  • Make the harness a pure function that takes a session ID, fetches events, runs inference, and appends results — with no in-memory state that persists between invocations
  • Run tool execution in a separate process that communicates via a narrow interface, and never pass credentials directly to that process

You don't need to get the sandbox isolation perfect on day one. But getting the session externalized early prevents the class of failures where a restart loses everything — and it makes the rest of the architecture much easier to evolve.

The harness that hardcodes assumptions about the current model will be your biggest liability as models improve. Build it to be replaced.

References:Let's stay in touch and Follow me for more thoughts and updates