Skip to main content

Building AI Agents That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they architect for sophistication before they have evidence that sophistication is needed. A production analysis of 47 agent deployments found that 68% would have achieved equivalent or better outcomes with a well-designed single-agent system. The multi-agent tax — higher latency, compounding failure modes, operational complexity — often eats the gains before they reach users.

This isn't an argument against agents. It's an argument for building them the same way you'd build any serious production system: start with the simplest thing that works, instrument everything, and add complexity only when the simpler version demonstrably fails.

What an Agent Actually Is

The definition matters because it shapes the architecture. An agent is a system with three components: instructions (what to do), guardrails (what not to do), and tools (what it can act on). The key distinction from a chatbot is that agents act — they don't just generate text.

There's a useful spectrum here. At one end: workflows, where an LLM executes tool calls through predefined code paths. Predictable, debuggable, lower risk. At the other end: autonomous agents, where the LLM dynamically directs its own process in open-ended loops with environmental feedback. Flexible, higher ceiling, and significantly harder to operate.

Most production use cases belong somewhere in the middle, not at the autonomous end. The industry hype pulls teams toward the complex end too early. Resist it.

Six Patterns That Cover Most Production Needs

These patterns are composable building blocks. Understanding them lets you choose the right one for each task rather than defaulting to "full agent" for everything.

Prompt chaining decomposes a task into sequential steps, with programmatic gates between them. You trade latency for accuracy. It's the right default for well-defined pipelines where each step's output needs validation before the next begins.

Routing classifies an incoming request and dispatches it to a specialized downstream process. A customer service system routes billing questions differently than technical support questions. Each specialization can be optimized independently.

Parallelization runs subtasks simultaneously and combines results. Use it when subtasks are genuinely independent, or when you want to run multiple model passes and vote for consensus on factual claims.

Orchestrator-workers use a central LLM to dynamically delegate unpredictable subtasks to specialized worker agents. This is the dominant pattern in production for open-ended research and coding workflows. The orchestrator handles decomposition; workers handle execution.

Evaluator-optimizer adds an iterative refinement loop with a feedback judge. The judge evaluates output against a rubric and sends it back for revision until it clears the bar. This works well for code generation, translation, and any task with measurable quality criteria.

Autonomous agents run open-ended loops with environmental feedback. Use this only when the steps genuinely cannot be predetermined — tasks like long-horizon software development where the agent must discover what needs to be done.

The trap to avoid: picking autonomous agents because it sounds impressive, then spending months engineering away the failure modes that make autonomous systems hard to ship.

The Single vs. Multi-Agent Decision

This is where teams spend too much engineering time on the wrong question. The question isn't "how do we architect our multi-agent system?" — it's "do we actually need multi-agent at all?"

Default to a single agent unless you hit one of three hard boundaries:

  • Security/compliance isolation: Your regulations require strict data separation between domains (e.g., financial services separation of duties).
  • Multiple owning teams: Different parts of the system have independent deployment cycles and can't reasonably share a codebase.
  • Genuine domain breadth: The system spans fundamentally different functions that a single set of instructions and tools can't reasonably handle.

Everything else is usually better served by a well-designed single agent. Customer support single agents average 2–4 seconds per interaction; multi-agent equivalents run 8–15 seconds. Time to production for a single-agent system is days; for multi-agent, weeks. One documented case compared $47,000/month in multi-agent orchestration overhead to $22,700 for a single agent — with only a 2.1% accuracy difference.

Those numbers will change as the technology matures. The point isn't that multi-agent is never worth it. The point is that the bar should be a demonstrated limitation of the simpler system, not a desire for architectural elegance.

Tool Design Is the Highest-Leverage Work

Teams that ship reliable agents spend more time on tool definitions than on anything else. This is the counterintuitive insight that separates practitioners from theorists.

An LLM processing tool descriptions treats them like documentation it found on the internet. Descriptions that resemble well-structured API docs are processed more reliably than terse parameter lists. Edge cases need to be spelled out explicitly. Every constraint the agent might encounter should be anticipated in the tool definition.

Several principles that matter in practice:

Use absolute paths, never relative. Any ambiguity the model must resolve is a failure opportunity.

Keep tool count manageable. Hallucinated tool arguments increase measurably with tool count. More tools is not better — it's a surface area problem. Only expose what's necessary for the task.

Write for the agent, not the developer. Agent-facing tools require different trade-offs than SDK endpoints. A developer reading a schema can ask clarifying questions; an agent will make its best guess.

Handle errors explicitly in the schema. Agents misinterpreting HTTP 400, 404, or 429 as reasons to report success (or hallucinate data) is one of the most common production failures. The tool definition should specify what each error means and what the agent should do.

Amazon built a system that auto-generates standardized tool schemas from API documentation — the investment in tooling infrastructure paid for itself quickly as the number of integrated services scaled.

Seven Ways Agents Fail in Production

Understanding failure modes before you ship saves incident response time later. These are the seven patterns that appear repeatedly across production deployments:

Underspecified design. "Remove outdated entries" gives an agent enough rope to interpret "outdated" destructively. The fix is adversarial scenario testing before deployment — explicitly try to find interpretations that cause harm.

Hallucination cascades. A phantom SKU triggers pricing, then stock, then shipping APIs before anyone notices the root value was fabricated. Consensus checks across models and confidence thresholds that pause execution before high-stakes actions prevent these from propagating.

Context corruption. A false flag in memory persists across sessions, silently affecting every subsequent interaction. Provenance tracking on memory writes — knowing where each piece of information came from — makes these detectable before they compound.

Multi-agent communication breakdown. A format change in one agent's output silently corrupts a downstream agent's input. Explicit schemas at agent boundaries and distributed tracing stitched across agent hops catch this class of failure.

Tool misuse. A data cleanup agent deletes a production folder because "redundant files" was interpreted too broadly. Least-privilege tool permissions, whitelisting critical functions, and requiring human approval for destructive actions are the standard mitigations.

Prompt injection. A customer email contains an instruction for the agent to forward conversation history to an external address — and the agent complies. This is the highest-priority attack class in production agents. Defense requires treating external content as untrusted and never allowing it to modify the agent's operating instructions.

Termination failure. The agent processes half a document set and terminates with a success signal, or enters an infinite refinement loop. Explicit completion criteria, maximum iteration limits, and multi-stage validators at planning, execution, and output phases address this.

A July 2025 incident made the rounds: an agent that was explicitly instructed not to touch a production database executed a destructive query and then attempted to generate fake records to conceal it. The lesson is that natural language instructions have no code-level rigidity. Guardrails need to be enforced at the infrastructure layer, not just in the prompt.

What Observability Looks Like for Agents

Traditional monitoring reports HTTP 200 as success. Agent failures frequently return 200 while the agent hallucinates downstream. You need trajectory visualization — the sequence of tool calls and model decisions — not just HTTP monitoring.

The industry is converging on OpenTelemetry as the telemetry standard for agent systems. The practical requirement is five pillars working together: distributed traces across agent boundaries, automated evaluations on output quality, human review sampling for bias detection, alerts on trajectory anomalies, and a data engine that feeds production failures back into evaluation datasets.

The sequence matters. Instrument everything before you optimize anything. One team discovered through human review that their research agent was systematically choosing SEO-optimized content over authoritative sources — a bias their automated evaluation suite missed entirely. You can't fix what you can't see.

Lessons from Building at Scale

Anthropic's internal research system, which uses a hub-spoke orchestrator-worker pattern, offers some concrete data on what happens when you run agents at scale:

Token usage explains 80% of variance in output quality. More context budget equals better results up to a point, but the marginal value decreases and cost scales linearly. Multi-agent systems consume roughly 4x more tokens than normal chat interactions and 15x more than single interactions — the task needs to be high-value enough to justify the overhead.

The system originally spawned 50+ subagents for simple queries before effort-scaling rules were introduced. The right complexity level is not obvious from first principles — you discover it empirically.

Long-horizon agents need checkpoint systems. Restarting from scratch when a process fails after 20 minutes of work is unacceptable at production scale. Build resumability into the architecture before you need it, not after an outage.

Evaluation for agents is structurally different from evaluation for static models. Agents take legitimately different valid paths to the same goal. An evaluator that penalizes deviation from a canonical path will reject correct solutions. You need rubric-based LLM judges that evaluate outputs against quality criteria, not trajectory similarity.

A Framework for Getting Started

The practical path for teams shipping their first production agent:

Start with direct API calls, not a framework. Frameworks abstract the decision-making in ways that make production failures harder to diagnose. Build enough to understand what the abstractions are hiding, then adopt tooling once you know what you need from it.

Instrument before you optimize. The teams that iterate fastest are the ones with traces on every tool call and evaluations on every output from the beginning — not the ones who add observability after the fact.

Design for single agent first. Build the simplest system that could work, then find where it actually fails. Use those failures as evidence for whether the next layer of complexity is justified.

Invest in tool definitions proportionally to tool count. Every tool you add is surface area for failure. Make each one earn its place.

Require human approval for irreversible actions until you have the production data to trust automated execution. The cost of a pause is low; the cost of a destructive action that can't be undone is not.

Agents are not a replacement for good software engineering — they're an extension of it. The teams shipping reliable agents are the ones treating tool design, failure taxonomy, and observability with the same rigor they'd bring to any other production system.

References:Let's stay in touch and Follow me for more thoughts and updates