Agent Engineering Is a Discipline, Not a Vibe

February 2, 2026 · 10 min read

Software Engineer

Most agent systems fail in production not because the underlying model is incapable. They fail because the engineering around the model is improvised. The model makes a wrong turn at step three and nobody notices until step eight, when the final answer is confidently wrong and there are no guardrails to catch it. This is not a model problem. It is an architecture problem.

Agent engineering has gone through at least two full hype cycles in three years. AutoGPT and BabyAGI generated enormous excitement in spring 2023, then crashed against the reality of GPT-4's unreliable tool use. A second wave arrived with multi-agent frameworks and agentic RAG in 2024. Now, in 2026, more than half of surveyed engineering teams report having agents running in production — and most of them have also discovered that deploying an agent and maintaining a reliable agent are different problems. The teams that are succeeding are treating agent engineering as a structured discipline. The teams that are struggling are still treating it as a vibe.

The Field Cannot Agree on a Definition — and That Is a Signal

Ask ten practitioners to define "AI agent" and you will get ten different answers. OpenAI describes agents in terms of a runtime model plus instructions plus tools. LangChain's framing centers on the degree to which an LLM controls the execution flow. Research papers decompose agents into perception, brain, planning, action, and collaboration layers.

The disagreement is not merely academic. When your team cannot agree on what an agent is, it cannot agree on what properties it should have, what can fail, and what needs to be engineered. This definitional ambiguity is one reason production deployments regularly miss critical design decisions.

A useful working framework covers six dimensions:

Intent: How the agent interprets goals — including multimodal inputs, task decomposition, and success criteria
Memory: How the agent maintains coherence across turns and sessions — in-context state, external storage, and skill libraries
Planning: How the agent sequences actions — from linear chains to tree search to self-revision loops
Control flow: Who or what decides what happens next — the LLM, deterministic code, or a hybrid
Authority: What the agent is allowed to do and under what conditions — trust boundaries, confirmation gates, rollback capabilities
Tools: What external capabilities the agent can invoke — APIs, search, code execution, browsers

The interesting observation is that most agent failures trace back to exactly one of these dimensions being underspecified. An agent that loops endlessly has a planning problem. An agent that performs irreversible actions without checkpoints has an authority problem. An agent that drifts after twenty turns has a memory problem.

Control Flow Is the Highest-Leverage Design Decision

The defining characteristic of a truly agentic system is that the LLM influences the control flow. In a traditional software system, control flow is explicit: if this, then that. In an agent, the LLM reads context and decides what action to take next. The more agentic the system, the more of those branching decisions live inside the model.

This is powerful and dangerous in equal measure.

The power is flexibility. An agent can handle novel situations that were never explicitly anticipated during design. It can recover from partial failures, adapt its plan based on intermediate results, and pursue a goal through paths the developer never mapped.

The danger is that control flow bugs in agentic systems are invisible until they compound. In conventional code, a branch taken incorrectly produces a wrong result. In an agent, a wrong decision at turn three corrupts the context that informs turns four through ten. The agent does not crash. It proceeds, confidently, on a foundation of stale or incorrect intermediate state.

The engineering response is flow engineering: treating the topology of agent decisions as a first-class design artifact rather than an emergent property. This means making explicit which transitions are LLM-controlled and which are deterministic. It means identifying decision points where an incorrect branch causes irreversible downstream effects and adding hard gates at those points. It means drawing the state machine before writing the prompts.

Teams that are succeeding in production tend to apply a simple test at each LLM-controlled decision point: what happens downstream if this step returns a plausible-but-wrong answer? If the answer is "the agent quietly proceeds to a bad outcome," that decision point needs either a deterministic fallback, a human confirmation gate, or a confidence threshold that triggers a retry.

Memory Is Underspecified in Almost Every Production System

Memory is the component that most teams treat as an afterthought and most production systems get wrong. In-context memory works fine for short sessions. For anything involving sessions longer than a few thousand tokens, multi-session continuity, or shared state across concurrent agents, in-context memory is insufficient.

The pattern that appears in most production failures is this: the agent was designed for a short session and worked well in testing. When deployed, users have longer sessions or return for follow-up interactions. The agent loses coherence. Earlier established facts get truncated out of context. The agent contradicts itself or asks for information it already has.

Production memory architecture needs at least three layers:

Working memory — in-context state for the current session, managed through summarization and attention to what is actually needed at each step
Episodic memory — compressed summaries of past sessions, structured enough to be retrieved accurately, not so verbose that retrieval is noisy
Semantic memory — persistent facts, user preferences, and domain knowledge, stored externally and retrieved with enough structure to avoid the retrieval failures that plague naive RAG implementations

The hardest part is not building any one of these layers in isolation. It is managing the interface between them and deciding what information moves between layers and when. An agent that aggressively compresses episodic memory will lose detail. An agent that retrieves too liberally will dilute working memory with irrelevant past context. The right balance is specific to the application and requires empirical tuning, not defaults.

Planning Failure Modes Teams Consistently Underestimate

Most agent frameworks offer a planning loop: generate a plan, execute steps, observe results, revise the plan. This sounds robust. In practice it introduces failure modes that teams consistently underestimate.

Premature convergence. An agent that generates a plan and then commits to it will execute a suboptimal plan when intermediate observations suggest a better path. Planning loops need explicit revision triggers — conditions under which the agent stops and replans rather than continuing to execute.

Infinite loops. An agent stuck on a subtask will retry it indefinitely if there is no timeout or maximum iteration count. Timeouts are not a sign that your agent is broken. They are a required component of any production planning loop.

Irreversibility blindness. Planning agents do not have an inherent sense of which actions are reversible and which are not. An agent that schedules a meeting, sends an email, and then realizes it misunderstood the user's intent cannot unsend the email. Planning needs to distinguish between reversible and irreversible operations and require higher-confidence thresholds before executing irreversible actions.

Compounding approximation. Multi-step plans accumulate small errors. Each step's output becomes the next step's input, and approximation errors compound. A plan that is 95% accurate at each step is roughly 60% accurate at ten steps if errors are independent — and they often are not independent, because errors in early steps create systematic biases in later steps.

The engineering response to these failure modes is not to avoid planning loops — they are genuinely powerful for complex tasks. The response is to treat the planning loop as a state machine with explicit transitions, and to make those transitions testable.

Authority Is the Reliability Problem That Feels Like a Policy Problem

Authority — what the agent is allowed to do, and under what conditions — is frequently treated as a policy decision made by product managers, not an engineering decision made by builders. This is a mistake.

The technical implementation of authority has direct reliability implications. An agent with no authority constraints will happily execute destructive operations when it misunderstands a goal. An agent with overly rigid authority constraints will fail on legitimate tasks and create user-visible errors. The right implementation is neither; it is a layered authority model with explicit confirmation gates at high-consequence operations.

The concept of minimum necessary authority from security engineering applies here. An agent performing a research task should not have write access to production databases, even if write access is technically available. An agent scheduling meetings should not be able to cancel existing meetings without explicit confirmation. Constraining authority in advance is far cheaper than debugging a production incident caused by an agent that had authority it should not have used.

Authority also has a time dimension. Many production agents need to perform long-running tasks — sequences of operations that span minutes or hours. These agents need checkpoints: points at which they can be paused, inspected, and resumed or redirected without losing all previous work. A long-running agent without checkpoints is a liability: if it goes wrong at step twenty, the only recovery option is to restart from the beginning.

Debugging Agents Is Not Debugging Code

When a traditional program fails, you read the stack trace. When an agent fails, there is often no stack trace. The agent completed successfully from its own perspective — it just produced a wrong answer, took a wrong action, or entered a loop that eventually hit a timeout.

Debugging agentic systems requires observability infrastructure that most teams do not have at the time they first encounter production failures. The minimum viable observability stack for a production agent includes:

Step-level tracing — a record of every LLM call, every tool invocation, and every state transition, with inputs and outputs at each step
Statistical monitoring — baseline metrics for session success rate, step count per session, and tool error rate, with alerts calibrated to statistical deviations rather than hard thresholds
Replay capability — the ability to re-execute a specific session from a checkpoint using a fixed seed, to reproduce failures deterministically for debugging

The non-determinism of LLMs makes the replay capability the hardest to build and the most valuable to have. Without it, a failure observed once cannot be reliably reproduced, and fixing it without reproduction is guesswork.

Among teams with agents in production, approximately 94% report having some form of observability in place, but less than 52% report running systematic offline evaluations on test sets. The implication is that most teams know they need to observe what their agents do, but fewer have operationalized the practice of evaluating whether what the agents do is correct.

What the Teams Getting It Right Have in Common

Reliable agent systems in production tend to share a set of structural properties that have little to do with model selection and everything to do with engineering decisions made before writing the first prompt.

They treat the agent's control flow as an artifact that needs to be designed explicitly, not a property that emerges from the model's intelligence. They have a clear answer to the question: at every decision point, what is the failure mode if this step returns a plausible-but-wrong answer?

They build memory architecture for the actual usage pattern, not the testing pattern. This means understanding that production sessions are longer and more varied than test sessions, and designing memory layers accordingly.

They apply minimum necessary authority by default and require explicit justification for each capability the agent has that is not strictly required for its task.

They have observability infrastructure in place before the agent goes to production, not after the first incident.

And they treat agent engineering as a distinct skill that integrates product thinking, data infrastructure, and reliability engineering — not as software engineering with a different syntax. The teams that are struggling in 2026 are largely still treating it as the latter.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Agent Engineering Is a Discipline, Not a Vibe

The Field Cannot Agree on a Definition — and That Is a Signal

Control Flow Is the Highest-Leverage Design Decision

Memory Is Underspecified in Almost Every Production System

Planning Failure Modes Teams Consistently Underestimate

Authority Is the Reliability Problem That Feels Like a Policy Problem

Debugging Agents Is Not Debugging Code

What the Teams Getting It Right Have in Common

Recommended Reading

About Tian Pan

The Field Cannot Agree on a Definition — and That Is a Signal​

Control Flow Is the Highest-Leverage Design Decision​

Memory Is Underspecified in Almost Every Production System​

Planning Failure Modes Teams Consistently Underestimate​

Authority Is the Reliability Problem That Feels Like a Policy Problem​

Debugging Agents Is Not Debugging Code​

What the Teams Getting It Right Have in Common​

Recommended Reading

About Tian Pan

The Field Cannot Agree on a Definition — and That Is a Signal

Control Flow Is the Highest-Leverage Design Decision

Memory Is Underspecified in Almost Every Production System

Planning Failure Modes Teams Consistently Underestimate

Authority Is the Reliability Problem That Feels Like a Policy Problem

Debugging Agents Is Not Debugging Code

What the Teams Getting It Right Have in Common