Skip to main content

AI Agent Architecture: What Actually Works in Production

· 11 min read
Tian Pan
Software Engineer

One company shipped 7,949 AI agents. Fifteen percent of them worked. The rest failed silently, looped endlessly, or contradicted themselves mid-task. This is not a fringe result — enterprise analyses consistently find that 88% of AI agent projects never reach production, and 95% of generative AI pilots fail or severely underperform. The gap between a compelling demo and a reliable system is not a model problem. It is an architecture problem.

The engineers who are shipping agents that actually work have converged on a set of structural decisions that look nothing like the toy examples in framework tutorials. This post is about those decisions: where the layers are, where failures concentrate, and why the hardest problems are not about prompts.

The Five Layers Every Production Agent Needs

Regardless of framework or domain, every agent system that survives contact with production can be decomposed into five concerns:

Perception is what reaches the model. It includes document parsing, multimodal preprocessing, and — critically — filtering. Most production systems apply relevance scoring and deduplication before anything enters the reasoning loop. What you exclude matters as much as what you include.

Reasoning is the LLM core. The two dominant patterns are ReAct (iterative reasoning and acting, one step at a time) and Plan-and-Execute (upfront decomposition, then execution). ReAct handles uncertainty better but burns more tokens and makes per-request cost unpredictable. Plan-and-Execute is faster and cheaper for stable, well-scoped tasks but fails when the initial plan is wrong. Most production teams use both: plan-and-execute for routine workflows, ReAct for exploratory or exception paths.

Memory is where most production failures actually live (more on this below).

Tool execution is the integration layer — calling APIs, querying databases, writing files. The surface area here is where security failures, cost overruns, and cascading errors originate.

Orchestration ties everything together with state management, checkpointing, and control flow. The industry has moved decisively toward explicit graph-based state machines rather than open-ended agent loops.

Memory Is the System

The most common production failure mode is not a model failure. It is a memory failure. Agents repeat data fetches already completed in the same session. They contradict decisions made two tool calls ago. They behave differently in session 1,000 than in session 1 because nothing persists between runs. These bugs are harder to reproduce and harder to fix than model errors because they are architectural.

Production teams have converged on four distinct memory tiers, each with different implementation requirements:

Working memory is what lives in context right now — the current task, recent tool results, active constraints. The hard limit is roughly 8,000 tokens; exceeding it produces measurable 40–60% latency increases and degraded instruction-following. Every token beyond that is taxed twice: in inference cost and in quality loss. Keeping working memory tight is not optimization — it is a correctness requirement.

Episodic memory stores structured records of past interactions: timestamped JSON events that agents can query. The retrieval pattern that works in practice combines semantic search with recency weighting, because a tool call from five minutes ago is almost always more relevant than a semantically similar call from three days ago. Pure embedding search misses this.

Semantic memory holds relationship knowledge — the kind of facts that do not fit in a session but need to be reasoned over relationally. Knowledge graphs outperform flat embedding stores here because relational queries (who approved what, which vendor handles which contract type) have structure that vector similarity cannot express. Semantic memory requires scheduled consolidation jobs to resolve contradictions as the knowledge base grows.

Procedural memory stores successful workflow traces as reusable templates. Teams that implement this report 30–50% reductions in planning errors on routine tasks, because the agent is not re-deriving a workflow it has already solved.

The practical takeaway: if your agent has no explicit memory architecture, you have working memory only. That means every session starts from zero, every repeated task burns full inference cost, and every failure leaves no trace for future runs to avoid.

Orchestration vs. Routing: A Critical Distinction

Most teams start with routing — rules-based dispatch that sends billing questions to the billing agent and support requests to the support agent. Routing works, and it is the right starting point. But it breaks down when tasks require state that persists across multiple steps, when failures need to resume rather than restart, and when human approval needs to interrupt a running workflow.

Orchestration is the stateful version: graph-based workflow management with nodes for tool calls and LLM invocations, edges for permissible transitions, and checkpoints that allow pause, resume, and replay. The industry term for this has settled on "flow engineering" — treating agent behavior as a graph traversal problem rather than a prompt problem.

The trade-off is real: graph-based orchestration requires distributed systems knowledge, state persistence strategy, and explicit debugging infrastructure that simple routers do not. Teams without this background frequently underestimate the implementation cost. But the alternative — an open-ended agent loop with no explicit state machine — produces systems that are impossible to debug in production and nearly impossible to make reliable.

For multi-agent systems, a three-level hierarchy adds a minimum of 6 seconds of coordination overhead before any worker starts (roughly 2 seconds per LLM call per level). Multi-agent architectures are only worth building when the decomposition benefits — parallelization, domain specialization, failure isolation — exceed that coordination cost.

Tool Calling in Practice

The Model Context Protocol (MCP) has become the de facto standard for tool integration, reducing integration complexity from N×M implementations to N+M. But a research analysis of production MCP tool descriptions found that 89.8% have unstated limitations, 89.3% have missing usage guidelines, and 84.3% have opaque parameters. The standard is sound; the ecosystem documentation is not.

What this means practically: do not trust tool descriptions at face value. Teams that have shipped reliable tool-calling agents enforce explicit maximum tool call counts, validate available tools strictly before execution, and treat tool errors as expected events with retry logic — not exceptional failures.

The other critical pattern: keep tool call results outside the main conversation context. Accumulating tool results in the conversation thread is the fastest way to blow past context limits in long-running workflows. Summarize results and store them in episodic memory instead.

A failure mode worth naming explicitly: the polling anti-pattern. Agents that continuously poll for updates using request-response APIs consume 95% of their API calls wastefully, because the infrastructure model does not match the agent's event-driven requirements. If your agent needs to wait for something to happen, design an event notification path — do not poll.

Context Engineering Replaced Prompt Engineering

The framing that dominated 2023 and 2024 was prompt engineering: how you communicate with the model. The framing that dominates production in 2025 is context engineering: what information the model has at each step.

A production context pipeline has five dynamically assembled layers at inference time:

  1. System instructions
  2. Retrieved knowledge (RAG output, filtered for relevance)
  3. Persistent memory (relevant episodic and semantic records)
  4. Conversation history (compressed)
  5. Tool definitions

The order is not arbitrary. Models pay disproportionate attention to the beginning and end of context windows. System instructions and the most task-relevant context belong at the boundaries. Filler content in the middle — verbose tool schemas, irrelevant history — degrades the model's ability to follow the actual instructions.

Organizations that have built explicit context pipelines with dynamic assembly at runtime report 50% improvements in response latency and 40% higher output quality compared to static prompt templates. The difference is not model capability — it is information architecture.

The Seven Failure Modes

An analysis of failed enterprise agent deployments found that seven patterns account for 94% of production failures:

Scope creep (34%) is the most common. A "bounded automation" project — automate this specific workflow — expands mid-build into an open-ended reasoning system with undefined boundaries. The fix is a written scope contract before architecture begins, with explicit exclusions.

Data quality (27%) is second. Agents do not tolerate incomplete or stale data the way humans do — they multiply errors across tool call chains. An agent querying a database with 15% stale records does not produce 15% wrong answers; it produces subtly wrong answers throughout, with high confidence.

Security blockers (14%) are discovered late. Access controls that seem reasonable in a human workflow become architectural problems when an agent needs to act autonomously across systems. The fix is involving security review before system design is complete, not at gate review.

The remaining four — integration complexity (9%), cost overruns (7%), governance gaps (5%), organizational resistance (4%) — are each significant but more tractable with standard engineering practices.

The failure mode that deserves special attention is what practitioners call "dumb RAG": dumping entire data repositories into vector databases and retrieving top-k results without filtering for relevance or contradiction. The result is that context windows fill with conflicting information, and the model produces high-confidence hallucinations — because it has evidence for both the correct and incorrect answer. The fix is layered retrieval with explicit relevance thresholds, not just embedding similarity.

Reliability Numbers to Design For

A 5% per-step failure rate on a 20-step agent workflow produces a roughly 64% success rate end-to-end. That agent is unusable without human oversight on every run. The production target for autonomous agents is below 1% failure per step. That requires explicit error handling, retry logic with backoff, graceful degradation when tools fail, and checkpointing so failures can resume rather than restart.

Latency targets by agent type have become reasonably standardized:

  • Simple query agents: under 1,000ms P95
  • Complex workflow agents: under 4 seconds P95
  • Multi-agent orchestration: under 6 seconds P95
  • Voice agents: under 800ms to first audio byte P95

Voice agents are the hardest case. Typical P99 latency for voice agent systems sits at 8–15 seconds — that 1% of outliers represents thousands of bad experiences daily at scale.

Agent performance has a consistency problem that benchmarks miss: agents that achieve 60% success on single-run evaluations often drop to 25% when measured across 8 runs for the same task. Consistency at scale is not the same as accuracy on a test set.

Human-in-the-Loop as Architecture

The right framing for human oversight is not "add a confirmation step." It is a tiered risk model built into the architecture from the start:

  • Tier 1 (low-stakes, reversible): autonomous action
  • Tier 2 (medium-stakes, reversible): take action and notify
  • Tier 3 (financial, legal, or irreversible): require approval before acting

The tiers need to be defined explicitly for each tool and workflow, not inferred from general intuition. An agent sending a Slack message is Tier 1. An agent sending a customer-facing email is Tier 2. An agent modifying a production database record is Tier 3. These boundaries need to be in the state machine, not in the prompt.

Observability is the other non-negotiable. Production agent systems need to capture agent spans, tool call sequences, token counts, and safety filter outcomes in a format that supports debugging — not just logging. The OpenTelemetry Generative AI semantic conventions have become the standard. If you cannot reconstruct exactly what an agent did and why in a failed session, you cannot fix the failure.

What This Means for How You Build

The market for AI agents is real and growing fast — enterprise adoption has crossed 57% of organizations with agents in production. The failure rates are also real. The gap is not in model capability; it has been narrowing rapidly. The gap is in surrounding systems: state management, data infrastructure, integration architecture, security design, and observability.

The engineers shipping reliable agents have stopped treating agent development as a prompting problem and started treating it as a distributed systems problem. The same principles apply: explicit state contracts between components, typed interfaces, failure modes designed in advance, observability from day one, and incremental scope with clear boundaries.

The hard part is not the demo. The demo has been easy since 2023. The hard part is the loop — making it reliable when the same task runs 10,000 times, with real data, in an environment where tools fail and inputs are messy. That problem is solved with architecture, not with better prompts.

References:Let's stay in touch and Follow me for more thoughts and updates