Harness Engineering: The Discipline That Determines Whether Your AI Agents Actually Work
Most teams running AI coding agents are optimizing the wrong variable. They obsess over model selection — Claude vs. GPT vs. Gemini — while treating the surrounding scaffolding as incidental plumbing. But benchmark data and production war stories tell a different story: the gap between a model that impresses in a demo and one that ships production code reliably comes almost entirely from the harness around it, not the model itself.
The formula is deceptively simple: Agent = Model + Harness. The harness is everything else — tool schemas, permission models, context lifecycle management, feedback loops, sandboxing, documentation infrastructure, architectural invariants. Get the harness wrong and even a frontier model produces hallucinated file paths, breaks its own conventions twenty turns into a session, and declares a feature done before writing a single test.
The clearest evidence for harness primacy comes from SWE-bench, the standard benchmark for coding agents. The same model scores dramatically differently depending on the scaffold wrapping it — gaps of 20–30 percentage points between harness implementations on identical underlying models. SWE-bench is not just testing the model; it is simultaneously evaluating the harness. Teams treating model choice as the primary reliability variable are measuring the wrong thing.
Guides and Sensors: The Core Taxonomy
The most useful framework for thinking about harness design distinguishes two fundamentally different kinds of controls.
Guides are feedforward — they steer the agent before it acts. An AGENTS.md file is a guide. Architecture documentation in the repository is a guide. A bootstrapping script that initializes the agent's working context before it takes the first action is a guide. Guides encode what good looks like, prevent bad outputs proactively, and inject project-specific knowledge that doesn't live in the model's weights.
Sensors are feedback — they observe what the agent did and create signals for correction. TypeScript's type checker is a sensor. ESLint is a sensor. A Playwright end-to-end test suite is a sensor. An AI code reviewer that flags semantic issues is a sensor. Sensors are what allow agents to self-correct within a single session rather than requiring human intervention at each step.
Sensors split further by type. Computational sensors — type checkers, formatters, structural linters — are deterministic, run in milliseconds, and provide binary pass/fail feedback. Inferential sensors — AI reviewers that assess whether code actually satisfies the intent — are probabilistic, slower, and catch meaning-level errors that structural tools miss entirely. A mature harness uses both.
The practical implication is that most teams have accidental harnesses, not designed ones. They have some TypeScript and maybe a linter. They've added a CLAUDE.md or AGENTS.md file. But they haven't asked: what does the agent need to know before it starts? What feedback loops exist after every action? What happens when those loops detect a failure? The answers to those questions — not the model — determine production reliability.
Your Codebase Is Now a Communication Protocol
Something fundamental changes when agents are doing primary code generation: the repository itself becomes the primary communication channel between human engineers and the agents they delegate to. Every architectural decision that lives in someone's head, in Confluence, in a Notion doc, or in tribal knowledge creates a hole in the harness.
One Codex team built a one-million-line production codebase in five months with three engineers generating roughly 1,500 pull requests — about 3.5 PRs per engineer per day, with no manually written source code. The single most important design decision: optimize the repository for agent legibility first. Not for human readability in the traditional sense, but for what a zero-context agent needs to operate effectively.
Concretely, that means:
- All architectural decisions must live in the repository, machine-readable, not scattered across external documentation systems
- Strict dependency layering enforced by structural tests that run in CI — not documented in a wiki somewhere, mechanically enforced
- Cross-linked documentation validated by linters so that references don't rot when code changes
- An AGENTS.md or CLAUDE.md file that every session loads, containing tech stack, build/test commands, and architectural constraints
The last point deserves emphasis. Research comparing human-curated versus LLM-generated AGENTS.md files found a striking asymmetry: LLM-generated files actively hurt agent performance, while human-curated files yield roughly a four-percentage-point improvement on agent benchmarks. The file that loads into every agent session and shapes every subsequent decision should be carefully maintained by humans, not auto-generated.
Invariants Over Micromanagement
One of the most counterintuitive lessons from teams running agents at scale: trying to control agent behavior through verbose instructions in prompts doesn't scale. Instructions get long, agents start ignoring the ones at the end, and the rules-to-enforcement ratio keeps getting worse.
The more durable approach is enforcing invariants rather than describing preferences. When an architectural boundary violation or a formatting inconsistency fails CI rather than being expressed in a style guide, the agent doesn't need to remember the rule — it gets a signal. This is the same principle that makes type systems more effective than code review comments at enforcing contracts.
One effective pattern: encode architecture as structural fitness functions that run in CI. If you have a strict layering rule — say, that service layer code cannot import from the runtime layer — write a test that enforces it. The agent doesn't need to understand the architectural reasoning; it just needs the test to pass before the PR merges. The invariant is self-enforcing.
This shifts the human engineer's primary job from writing code to writing acceptance criteria. Specifying what "done" looks like — in terms that a sensor can evaluate — is more leveraged than generating the implementation. Vague prompts produce hallucinated file paths and modifications to wrong modules. A structured task that specifies which files should change, what the acceptance criteria are, and what tests should pass constrains the solution space enough to make outputs predictable.
The Self-Evaluation Problem
Any harness that asks the generating agent to judge its own output gets confidence without accuracy. This is not a subtle failure mode — it's a systematic bias. Models consistently overestimate the quality of code they just produced, and the overconfidence compounds as sessions get longer and the agent has invested more in a particular approach.
The architectural response is separating the generator from the evaluator. Anthropic's harness for long-running agents does this explicitly: a planner expands prompts into detailed specs, a generator implements features, and a separate evaluator tests against defined criteria — using Playwright for browser interactions, grading against design quality, originality, craft, and functionality. The evaluator and generator also negotiate sprint contracts before implementation begins, agreeing on what "done" looks like for a chunk of work before any code is written.
The generator-evaluator split has a second benefit: it makes the evaluation criteria explicit and inspectable. When the evaluator is a separate agent with a defined rubric, humans can review and adjust what's being measured. When it's the same agent self-assessing, the rubric is invisible and invariably optimistic.
Context Windows Are Stateless Shift Workers
The most common category of long-running agent failure isn't wrong code — it's context management failure. After enough turns, agents forget earlier constraints. They re-implement functions they wrote twenty steps ago. They lose the thread of what the current feature is supposed to accomplish. They start confidently doing things that contradict decisions made at the start of the session.
Context rot is architectural, not accidental. As token count grows, accuracy degrades — the attention mechanism's effective recall for early context decreases, and instructions given in the system prompt start competing with a growing body of observations and tool outputs. Every component of an agent harness is, in some sense, a response to this constraint.
Effective harnesses address context management at multiple layers:
- Initialization scripts that restore project state at the start of every session, regardless of what happened in previous sessions
- Persistent progress files (often in JSON rather than Markdown, because models are less likely to edit structured data inappropriately) that carry task state across context boundaries
- Shadow git snapshots on every file change, enabling per-step rollback when a long chain of actions needs to be partially unwound
- Event-driven reminders injected at decision points rather than front-loaded in long system prompts — this counters instruction fade-out without inflating context further
- Context resets with structured handoffs for very long-running tasks, where the agent's accumulated observations are summarized and a fresh context begins
One team running agents with 200+ feature items stored the feature list as JSON — not Markdown — specifically because the model was less likely to make inappropriate edits to structured data formats than to prose. That's the granularity of harness engineering in practice.
Sandboxing Is Infrastructure, Not Security Theater
At scale, execution sandboxing is non-negotiable. Research suggests 40–62% of AI-generated code contains security vulnerabilities when tested against established criteria. But the primary reason for sandboxing isn't catching malicious output — it's enabling velocity. Agents that can run code and observe effects in isolated environments iterate faster and more accurately than agents relying on static analysis alone.
The infrastructure pattern that's emerging: ephemeral environments provisioned in sub-seconds, scoped to only the modified workload rather than a full cluster, auto-destroyed on task completion or failure. Human reviewers get a sandbox link to interact with running code immediately — no local setup, no environment drift between review and production.
E2E test access matters more than unit test access for agent accuracy. An agent with Playwright or Puppeteer can verify behavior "as a human user would" — a much stronger signal than unit tests, which are easy to write in ways that pass while the feature is broken. This isn't about agent capability; it's about feedback loop quality. Better sensors produce more accurate agents.
What Harness Engineering Actually Changes
The teams doing this well have made several practical shifts:
They've stopped treating AGENTS.md as documentation and started treating it as engineering infrastructure — versioned, reviewed, tested against actual agent performance. They maintain it the same way they maintain other CI configuration.
They've moved quality checks left, distributing them across pre-commit (fast linters), PR integration (type checking, architecture fitness functions), and continuous monitoring (drift detection). The traditional CI pipeline is too slow for agent-generated code at velocity.
They've accepted that every component of their harness encodes a hypothesis about what the current model can't do reliably on its own — and they revisit those hypotheses as models improve. Scaffolding that was necessary six months ago may now add overhead without value.
Most importantly, they've stopped asking "which model is best" and started asking "what does the harness need to reliably produce the outputs we require?" The model is a component. The harness is the product.
The teams treating agent deployment as a model selection problem are going to keep being surprised by production failures that happen three hours into a session, after the model has drifted away from the constraints specified at session start. The teams building harnesses are shipping.
- https://martinfowler.com/articles/harness-engineering.html
- https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
- https://arxiv.org/html/2603.05344v1
- https://developers.redhat.com/articles/2026/04/07/harness-engineering-structured-workflows-ai-assisted-development
- https://www.infoq.com/news/2026/02/openai-harness-engineering-codex/
- https://www.infoq.com/news/2026/04/anthropic-three-agent-harness-ai/
- https://newsletter.eng-leadership.com/p/how-openais-codex-team-works-and
- https://new.signadot.com/blog/your-infrastructure-isnt-ready-for-agentic-development-at-scale
- https://arxiv.org/html/2511.09268v1
- https://www.latent.space/p/anita-tdd
- https://www.swebench.com/
- https://github.com/microsoft/agent-governance-toolkit
