Skip to main content

The Anatomy of an Agent Harness

· 8 min read
Tian Pan
Software Engineer

There's a 100-line Python agent that scores 74–76% on SWE-bench Verified — only 4–6 percentage points behind state-of-the-art systems built by well-funded teams. The execution loop itself isn't where the complexity lives. World-class teams invest six to twelve months building the infrastructure around that loop. That infrastructure has a name: the harness.

The formula is simple: Agent = Model + Harness. The model handles reasoning. The harness handles everything else — tool execution, context management, safety enforcement, error recovery, state persistence, and human-in-the-loop workflows. If you've been spending months optimizing prompts and model selection while shipping brittle agents, you've been optimizing the wrong thing.

What a Harness Is (and Is Not)

The term gets confused with adjacent concepts. Distinctions worth keeping:

Scaffolding assembles the agent before it runs — writing the system prompt, registering tool schemas, setting up the subagent registry. It's the setup phase.

Harness is the runtime. It orchestrates the execution loop, dispatches tool calls, manages context as the session grows, enforces safety checks, persists state between turns, and surfaces results. Scaffolding runs once; the harness runs continuously.

Frameworks (LangChain, LangGraph, smolagents) provide building blocks. A harness assembles them into an opinionated runtime with specific defaults for your use case. You can build a harness on top of a framework, or build one from scratch — the 100-line agent that benchmarks at 74% is entirely from scratch.

Orchestrators control agent logic — when to call models, how to route decisions. The harness provides the capabilities those orchestrators rely on: tools, memory, execution environment, permission enforcement.

The Execution Loop

Every production agent framework — OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, smolagents — converges on the same core loop:

while not done:
response = call_llm(messages)
if response.tool_calls:
results = execute_tools(response.tool_calls)
messages.append(results)
else:
done = True
return response

This is Thought-Action-Observation (also called ReAct). The model reasons, requests a tool, gets the result, reasons again. Each cycle is a turn. The loop terminates when the model returns a response with no tool calls.

This core is simple. What makes it hard to run in production are the things the loop doesn't handle by default:

  • What happens when a tool call fails?
  • What happens when context fills up mid-task?
  • What happens when the agent requests a destructive action?
  • What happens when the task needs to pause for human review?
  • How do you attribute cost across thousands of concurrent sessions?

A harness answers all of these.

The Components

Tools

Tools are function schemas injected into context and dispatched by the harness when the model requests them. The harness handles registration, schema validation, sandboxed execution, and result formatting.

Here's the number that should change how you prioritize work: tool outputs comprise 67.6% of what agents actually see in context. System prompts account for 3.4%. Tool definitions, another 10.7%. If you've been iterating on your system prompt while ignoring tool output formatting, you've been optimizing 3% of your agent's context while neglecting 68%.

The implications are direct. Verbose tool outputs — full file contents when a summary would do, raw API responses with redundant fields — are the fastest path to context rot. Compact, structured tool outputs preserve working memory longer and cost less.

Context Management

Context rot is real and measurable. Research consistently shows performance degrading 30%+ as useful content gets buried in the middle of a growing context window, even in models rated for million-token contexts. The degradation isn't linear — there's often a threshold around 256k tokens where reliability drops sharply regardless of the stated context limit.

Five strategies that work in production:

Compaction: Summarize conversation history periodically. Preserve architectural decisions, key findings, and open questions. Discard redundant tool outputs and intermediate reasoning that's already been acted on. One approach (ACON) demonstrated 26–54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw outputs.

Observation masking: Keep the reasoning chain visible but hide old tool outputs. The model needs to know what it decided; it doesn't need to re-read the raw file it read three turns ago.

Just-in-time retrieval: Store lightweight identifiers (file paths, document IDs) rather than full content. Load data at execution time rather than pre-loading everything into context at session start.

Structured note-taking: Have the agent maintain external memory files — a progress log, a task list, a decisions file. This externalizes working memory to the filesystem, where it persists across sessions without consuming context.

Sub-agent delegation: Spawn specialized agents for subtasks. Each gets a fresh context window. Only their final 1,000–2,000 token summary returns to the parent. This is how you run tasks that would otherwise overflow any context limit.

Error Handling

The compounding-error math is sobering. A 10-step process with 99% per-step reliability has ~90% end-to-end success. At 20 steps: ~82%. At 50 steps: ~60%. This is why reliability engineering, not model quality, determines whether your agent ships.

Four error categories require different responses:

  • Transient errors (network timeouts, 503s): retry with exponential backoff and jitter
  • LLM-recoverable errors (wrong data types, malformed calls): return the error as a tool result so the model self-corrects on the next turn
  • User-fixable errors: interrupt the loop and surface the problem for human input
  • Fatal errors: bubble up immediately, don't retry

The worst harness anti-pattern is treating all errors identically — either retrying everything (infinite loops) or surfacing everything to the user (alert fatigue).

Safety and Guardrails

A single safety layer is not sufficient when your agent can execute shell commands, write files, call external APIs, and spawn subprocesses. The approach that works in production is defense-in-depth:

  1. Prompt-level: Instructions for what the agent should never do, security policies for data handling
  2. Schema-level: Per-subagent tool filtering; plan-mode that produces proposals without executing them
  3. Runtime approval: Human-in-the-loop checkpoints before high-risk actions, with configurable automation levels
  4. Tool-level validation: Dangerous pattern detection before execution, stale-read detection to prevent clobbering concurrent changes
  5. Lifecycle hooks: User-defined scripts that run before/after specific operations and can block execution

The propose-and-commit pattern solves the hardest reliability problem: the agent proposes the action, a human (or a separate validation layer) commits it. The AI handles the "what" and "why"; the harness handles the "how" and "where."

State Persistence and Pause/Resume

Stateless agents are prototypes. Production tasks run longer than a single session — they get interrupted by context limits, network failures, and deliberate human checkpoints.

LangGraph checkpoints execution state at each super-step boundary. Claude Code uses git commits. The mechanism doesn't matter much; what matters is that the complete execution state is serializable and recoverable, so a task interrupted at step 47 of 50 can resume from step 47, not from step 1.

The interrupt() pattern in agent frameworks pauses the execution loop, serializes state, and waits indefinitely. Common use cases:

  • Approval workflows before database mutations, financial transactions, or external communications
  • Mid-task human review of LLM outputs or tool call arguments before execution
  • Branching points where human judgment is required to proceed

For tasks that span multiple context windows, a multi-session pattern works well: an initializer agent runs once to set up the environment and write a progress file. Continuation agents read git logs and the progress file, orient themselves, and resume from the highest-priority incomplete task. The filesystem provides continuity across sessions — each new agent instance arrives like an engineer starting a shift, reading handoff notes left by the previous one.

Harness Thickness: The Core Design Tradeoff

Harnesses have a fundamental design tension. Thin harnesses delegate more to the model — simpler code, harder to debug, more dependent on model capability. Thick harnesses codify more behavior explicitly — more reliable, more code to maintain, harder to evolve as models improve.

The surprising finding: Vercel removed 80% of their agent tools and got better results. Fewer, well-chosen tools beat comprehensive tool catalogs. If a human engineer can't immediately say which tool to use for a given task, the agent can't either. Tool proliferation is a harness design failure.

Counter to this, there's a co-evolution dynamic worth understanding. Models get post-trained with specific harnesses in the loop. Changing tool interfaces, output formats, or execution semantics can degrade performance in ways that look like model regressions but are actually distribution shifts — the model learned to use the harness it was trained with. This is why major agent platforms treat harness interfaces as carefully as public APIs.

The Competitive Moat

The model is increasingly commodity. The harness is where durable advantage lives. Manus rewrote their harness five times in six months with the same underlying models; each rewrite improved reliability. LangChain re-architected their Deep Research system four times in a year. The models didn't change. The infrastructure did.

Building a production-ready harness takes months, not days. The execution loop is a solved problem — you can copy it in an afternoon. Context management, error recovery, safety enforcement, cost attribution, pause-and-resume workflows: these take sustained engineering investment to get right.

The 100-line agent that benchmarks at 74% on SWE-bench is impressive precisely because it proves the loop is not the bottleneck. The gap between 74% and 80% is not a better loop. It's better infrastructure around the loop.

If you're building agents that need to work in production, that's where to focus.

References:Let's stay in touch and Follow me for more thoughts and updates