The Anatomy of an Agent Harness

September 21, 2025 · 9 min read

Software Engineer

Most engineers building AI agents spend 80% of their time thinking about which model to use and 20% thinking about everything else. That ratio should be flipped. The model is almost interchangeable at this point — the harness is what determines whether your agent actually works in production.

The equation is simple: Agent = Model + Harness. If you're not the model, you're the harness. And the harness is where nearly all the real engineering lives.

A harness is the totality of code, configuration, and execution logic that surrounds a model. It handles state, tools, memory, execution, context windows, error recovery, and everything else a model can't do on its own. Models are stateless text transformers. Harnesses are what make them into agents.

What a Model Can't Do Alone

To understand why harnesses matter, start with what models fundamentally cannot do:

Maintain state between calls
Execute arbitrary code
Access information from after their training cutoff
Manage their own context window
Decide when they're done

These aren't temporary limitations waiting for the next model release. They're architectural facts. Even dramatically more capable models will need harnesses that handle filesystem access, tool dispatch, persistence, and lifecycle management. The harness isn't scaffolding to be discarded later — it's a permanent layer of the system.

This is why 2025 is increasingly seen as the year agents proved they could work, and 2026 is the year engineering teams are figuring out how to make them work reliably. That reliability gap lives almost entirely in the harness.

The Six Core Components

A production-grade agent harness has six functional areas. Miss any of them and your agent will fail in specific, predictable ways.

1. Filesystems and Storage

The filesystem is the harness's primary mechanism for durability. Without it, every agent run starts from scratch. With it, agents can:

Persist state across sessions and context window resets
Offload large artifacts out of the context window (write to disk, reference by path)
Coordinate with other agents — the filesystem is the collaboration surface for multi-agent teams
Implement memory through structured files that get injected into future prompts

Teams that skip durable storage end up rebuilding it under pressure when users complain that their agent "forgot" what it was doing. Build it in from the start.

2. Code Execution and Bash Tools

General-purpose code execution is one of the highest-leverage capabilities you can give an agent. When agents can run code, they shift from being confined to pre-configured tools to being able to design solutions on the fly.

An agent with a bash tool can:

Write and run its own diagnostic scripts
Transform data formats it wasn't explicitly programmed to handle
Verify its own output by running assertions
Install dependencies as needed

The practical upside is enormous. The practical downside is that you've given the agent a loaded weapon. Which brings us to the next component.

3. Sandboxes and Execution Environments

Sandboxing is the hardest unsolved problem in production agent deployment. Most teams underestimate it.

The naive approach is to run agent code on the host machine. This works fine until an agent runs rm -rf somewhere it shouldn't, or makes network calls to services you didn't intend, or installs a package that conflicts with your production environment.

A proper sandbox provides:

Isolation: Filesystem and network access scoped to what the agent actually needs
Pre-installed dependencies: The agent shouldn't need to bootstrap its environment on every run
Logging and observability: Every action the agent takes should be inspectable
Verification loops: Sandboxes should include test runners and linters so agents can check their own work

The key insight is that sandboxing isn't about constraining the agent — it's about giving the agent a safe surface to operate on autonomously. An agent in a well-designed sandbox can be more aggressive and capable, not less, because the blast radius of mistakes is contained.

Production-grade sandboxes that spin up in milliseconds now exist specifically for this purpose. If you're running agent code directly on your production infrastructure, you have a sandboxing problem you haven't discovered yet.

4. Memory and Knowledge Systems

There are two distinct memory problems in agent harnesses:

Long-term memory — what the agent should retain across sessions. This gets implemented as files, databases, or vector stores that the harness injects into context at the start of each run. The agent doesn't "remember" things natively; the harness loads its memories for it.

Real-time knowledge — what the agent needs to know that postdates its training. This gets implemented through web search tools, MCP (Model Context Protocol) integrations, and retrieval systems. Without this, your agent is operating on a snapshot of the world from months or years ago.

Both are essential. An agent with long-term memory but no real-time knowledge will have excellent recall but stale information. An agent with real-time knowledge but no long-term memory will be sharp but amnesiac. Most early agent implementations punt on long-term memory entirely, then wonder why users feel like they're starting over every session.

5. Context Management

Context windows are finite resources. Naive implementations fill them up and then break. Good harnesses treat context as a first-class engineering concern.

The main strategies:

Compaction: When the context window fills, summarize the conversation so far and start a fresh context with that summary. The model gets a clean window while key information is preserved.

Tool output offloading: Tool results — especially large ones like file reads or API responses — can be written to disk instead of inserted into context. The agent gets a reference to the file rather than the full content, which it can read when needed.

Progressive disclosure: Don't load all available tools into every prompt. Load only the tools relevant to the current task. Excessive tool loading degrades model performance by adding noise. Loading 50 tools when you need 5 actively hurts output quality.

Context budget tracking: Build awareness of how much context you've consumed and trigger compaction or offloading before hitting the limit, not after.

The teams getting this right treat context as a budget — not a bin to fill until it overflows.

6. Long-Horizon Execution Patterns

Single-shot tasks are mostly solved. The frontier is tasks that require hundreds of steps across multiple context windows.

The reliability math here is unforgiving. If each step in a multi-step pipeline succeeds 95% of the time, a 20-step task succeeds only 36% of the time end-to-end. At 50 steps, end-to-end success drops below 8%. Improving individual step reliability from 95% to 99% gets you to 82% at 20 steps — a 2x improvement without changing the task structure at all.

This makes per-step error recovery and retry logic load-bearing infrastructure, not an afterthought.

One pattern worth naming explicitly: the loop reinject pattern. When a long-running task requires more context than any single window can hold, the harness reinjects the original task prompt at the start of each new context window, along with the agent's filesystem-based state tracker. The agent always knows where it started and where it is. Context is ephemeral; the filesystem is the source of truth.

This pattern makes infinite-length tasks tractable. It also makes the harness's state management design the binding constraint on what tasks are possible.

The IMPACT Framework

A useful mental model for ensuring your harness covers all the bases is the IMPACT framework:

Intent: How does the agent understand and maintain its goal across a long task?
Memory: What does the agent remember and how?
Planning: How does the agent decompose tasks and sequence actions?
Authority: What permissions does the agent have and how are they enforced?
Control Flow: How does the agent loop, branch, and handle errors?
Tools: What actions can the agent take in the world?

Any harness that can't give a clear answer to all six questions has gaps that will surface as production failures.

The Reliability Multiplier

Here's why harness engineering is underrated: model quality has logarithmic returns in practice, but harness quality has compounding effects.

A 10% improvement in raw model capability might get you a 5% lift on your benchmark. A 10% improvement in per-step reliability across a 20-step workflow can double your end-to-end success rate.

Teams that focus exclusively on model selection while shipping a fragile harness are optimizing the wrong variable. The harness is where reliability is built or lost.

Multi-agent architectures amplify this further. When orchestrators delegate to specialized sub-agents, each with isolated context windows focused on specific tasks, the system as a whole can tackle problems that would overwhelm any single agent. But that architecture only works if each agent's harness is solid — a flaky sub-agent harness propagates failures up the chain.

Engineering Backwards from Behavior

The right way to design a harness is to start with the behavior you want and work backwards to the components you need.

If you need an agent that can handle tasks spanning days, you need long-term memory and a loop reinject pattern. If you need an agent that can modify codebases safely, you need a sandbox with test runners. If you need an agent that stays current, you need real-time knowledge tools. If you need multiple agents collaborating, you need a shared filesystem as the coordination layer.

Don't add infrastructure speculatively. But do map your intended use cases to the six core components before you ship, and be honest about which gaps you're accepting.

Looking Forward

As models improve, some harness functionality will migrate into the model itself. Future models may handle longer contexts natively, reducing the need for aggressive compaction. Better instruction following may reduce the failure rate on individual steps.

But the core harness functions — sandboxed execution, durable state, real-time knowledge, explicit memory management — will remain engineering concerns for the foreseeable future. The model handles reasoning; the harness handles everything that makes reasoning useful in the real world.

The engineers who figure this out early are building systems that their peers will spend 2026 trying to replicate. The harness is the moat.

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Anatomy of an Agent Harness

What a Model Can't Do Alone

The Six Core Components

1. Filesystems and Storage

2. Code Execution and Bash Tools

3. Sandboxes and Execution Environments

4. Memory and Knowledge Systems

5. Context Management

6. Long-Horizon Execution Patterns

The IMPACT Framework

The Reliability Multiplier

Engineering Backwards from Behavior

Looking Forward

Recommended Reading

About Tian Pan

What a Model Can't Do Alone​

The Six Core Components​

1. Filesystems and Storage​

2. Code Execution and Bash Tools​

3. Sandboxes and Execution Environments​

4. Memory and Knowledge Systems​

5. Context Management​

6. Long-Horizon Execution Patterns​

The IMPACT Framework​

The Reliability Multiplier​

Engineering Backwards from Behavior​

Looking Forward​

Recommended Reading

About Tian Pan

What a Model Can't Do Alone

The Six Core Components

1. Filesystems and Storage

2. Code Execution and Bash Tools

3. Sandboxes and Execution Environments

4. Memory and Knowledge Systems

5. Context Management

6. Long-Horizon Execution Patterns

The IMPACT Framework

The Reliability Multiplier

Engineering Backwards from Behavior

Looking Forward