Skip to main content

46 posts tagged with "agent-architecture"

View all tags

The Abandon Primitive: Why Your Agent Loop Needs a First-Class Way to Quit a Plan

· 11 min read
Tian Pan
Software Engineer

Look at the loop primitives most agent frameworks ship: continue, return, retry, and a step budget that hard-stops the run. Notice what is missing. There is a path that says "the work succeeded," a path that says "the model wants to keep going," and a path that says "we ran out of money or patience and shot the loop in the head." There is no first-class path that says "the plan I am executing is wrong, and I want to throw it away and start a different one." The abandon primitive — an explicit, structured way for the planner to declare its current trajectory hopeless — is the missing verb in the agent loop's grammar, and its absence is responsible for a category of failures that are usually misdiagnosed as "the model needs more reasoning."

A planner three steps into a doomed branch keeps refining the same wrong plan because the loop's only exits are succeed, retry the last step, or hit the budget. None of those are "give up on the strategy and try a different one." So the agent does what the loop allows: it edits its plan in place, calls one more tool, asks for one more clarification, and burns through its step budget converging on a non-solution. When the wall finally hits, the user sees a polite failure message that is not an answer to their question. The cost of those wasted steps is real — production data suggests 5–10% of token spend on agent systems goes into retries that produce nothing usable, and that figure is dominated by long doomed branches, not isolated tool errors.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

· 11 min read
Tian Pan
Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

· 9 min read
Tian Pan
Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

Conversation State Is Not a Chat Array: Multi-Turn Session Design for Production

· 10 min read
Tian Pan
Software Engineer

Most multi-turn LLM applications store conversation history as an array of messages. It works fine in demos. It breaks in production in ways that take days to diagnose because the failures look like model problems, not infrastructure problems.

A user disconnects mid-conversation and reconnects to a different server instance—session gone. An agent reaches turn 47 in a complex task and the payload quietly exceeds the context window—no error, just wrong answers. A product manager asks "can we let users try a different approach from step 3?"—and the engineering answer is "no, not with how we built this." These are not edge cases. They are the predictable consequences of treating conversation state as a transient array rather than a first-class resource.

Defining Escalation Criteria That Actually Work in Human-AI Teams

· 10 min read
Tian Pan
Software Engineer

Most AI teams can tell you their containment rate — the percentage of interactions the AI handled without routing to a human. Far fewer can tell you whether that number is the right one.

Escalation criteria are the single most important design document in an AI-augmented team, and most teams don't have one. They have a threshold buried in a YAML file and an implicit assumption that the AI knows when it's stuck. That assumption is wrong in both directions: too high a threshold and humans spend their days redoing AI work; too low and users absorb AI errors without recourse. Both failures are invisible until they compound.

The Share-Nothing Agent: Designing AI Agents for Horizontal Scalability

· 12 min read
Tian Pan
Software Engineer

Your load balancer assigns an incoming agent request to replica 3. But the user's conversation history lives in memory on replica 7. Replica 3 has no idea what has happened in the last six turns, so it starts over, confuses the user, and your on-call engineer gets paged at 2 AM. You add sticky sessions. Now all requests for that user route to replica 7 forever. You've traded a correctness bug for a scalability ceiling.

This is the moment teams realize that "horizontal scaling" for AI agents is not the same problem as horizontal scaling for web servers. The fixes are different, and the naive paths fail in predictable ways.

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

· 11 min read
Tian Pan
Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

When Workflow Engines Beat LLM Agents: A Decision Framework for Deterministic Orchestration

· 9 min read
Tian Pan
Software Engineer

Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 — primarily due to escalating costs, unclear business value, and inadequate risk controls. Industry surveys put the production success rate for autonomous AI agents somewhere between 5% and 11%. Those numbers suggest something important: for a large fraction of the tasks teams are throwing agents at, a deterministic workflow engine would have done the job faster, cheaper, and more reliably.

This isn't an anti-AI argument. It's an architectural one. The question isn't whether LLMs are capable — it's whether autonomous, open-ended reasoning is the right execution model for the task you're building. For a surprisingly large class of structured business processes, the answer is no.

Ambient AI Architecture: Designing Always-On Agents That Don't Get Disabled

· 9 min read
Tian Pan
Software Engineer

Most teams building ambient AI ship something users immediately turn off.

The pattern is consistent: the team demos the feature internally, everyone agrees it's useful in theory, and within two weeks of launch the disable rate exceeds 60%. This isn't a model quality problem. It's an architecture problem — and specifically an interrupt threshold problem. Teams design their ambient agents around what the AI can do rather than what users will tolerate when they didn't ask for help.

The gap between explicit invocation ("ask the AI") and ambient monitoring ("the AI watches and acts") is not just a UX question. It demands a fundamentally different system architecture, a different event model, and a different mental model for when an AI agent earns the right to speak.

When Your Agent Framework Becomes the Bug

· 8 min read
Tian Pan
Software Engineer

High-level agent frameworks promise to turn a three-day integration into a three-hour prototype. That promise is real. The problem is what happens next: six months into production, engineers at a company that builds AI-powered browser testing agents discovered they were spending as much time debugging LangChain as building features. Their fix was radical — they eliminated the framework entirely and went back to modular building blocks. "Once we removed it," they wrote, "we no longer had to translate our requirements into LangChain-appropriate solutions. We could just code."

They are not alone. Roughly 45% of developers who experiment with high-level LLM orchestration frameworks never deploy them to production. Another 23% eventually remove them after shipping. These numbers don't mean frameworks are bad tools — they mean frameworks are tools with a specific useful range, and that range is narrower than the demos suggest.

Agentic Task Complexity Estimation: Budget Tokens Before You Execute

· 10 min read
Tian Pan
Software Engineer

Two agents receive the same user message. One finishes in 3 seconds and 400 tokens. The other enters a Reflexion loop, burns through 40,000 tokens, hits the context limit mid-task, and produces a half-finished answer. Neither the agent nor the calling system predicted which outcome was coming. This is not an edge case — it is the default behavior when agents start tasks without any model of how deep the work will go.

LLM-based agents have no native sense of task scope before execution. A request that reads as simple in natural language might require a dozen tool calls and multiple planning cycles; a complex-sounding request might resolve in a single lookup. Without pre-execution complexity estimation, agents commit resources blindly: the context window fills quadratically as turn history accumulates, planning overhead dominates execution time, and by the time the system detects a problem, the early decisions that caused it are irreversible.

The Warm Handoff Pattern: Designing Fluid Control Transfer Between Agents and Humans

· 12 min read
Tian Pan
Software Engineer

Most agent escalation flows are cold transfers dressed up with good intentions. The agent decides it cannot proceed, drops a "I'm connecting you to a human" message, and routes the session to an operator who has no idea what the agent tried, what failed, or what the user actually needs. The human starts from scratch. The user repeats themselves. Trust erodes — not because the AI was wrong, but because nobody designed the boundary.

The warm handoff pattern is an architectural discipline for the exact moment an agent yields control. It treats that boundary as a first-class system concern rather than an afterthought. Done well, the receiving party — human or agent — steps into a briefed, structured situation. Done poorly, that boundary is where user trust goes to die.