Skip to main content

69 posts tagged with "ai-agents"

View all tags

How AI Agents Actually Work: Architecture, Planning, and Failure Modes

· 10 min read
Tian Pan
Software Engineer

Most agent failures are architecture failures. The model gets blamed when a task goes sideways, but nine times out of ten, the real problem is that nobody thought hard enough about how planning, tool use, and reflection should fit together. You can swap in a better model and still get the same crashes — because the scaffolding around the model was never designed to handle what the model was being asked to do.

This post is a practical guide to how agents actually work under the hood: what the core components are, where plans go wrong, how reflection loops help (and when they hurt), and what multi-agent systems look like when you're building them for production rather than demos.

LLM-Powered Autonomous Agents: The Architecture Behind Real Autonomy

· 8 min read
Tian Pan
Software Engineer

Most teams that claim to have "agents in production" don't. Surveys consistently show that around 57% of engineering organizations have deployed AI agents — but when you apply rigorous criteria (the LLM must plan, act, observe feedback, and adapt based on results), only 16% of enterprise deployments and 27% of startup deployments qualify as true agents. The rest are glorified chatbots with tool calls bolted on.

This gap isn't about model capability. It's about architecture. Genuine autonomous agents require three interlocking subsystems working in concert: planning, memory, and tool use. Most implementations get one right, partially implement a second, and ignore the third. The result is a system that works beautifully in demos and fails unpredictably in production.

The Agent Evaluation Readiness Checklist

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they start with the evaluation infrastructure before they understand what failure looks like. They instrument dashboards, choose metrics, wire up graders — and then discover their evals are measuring the wrong things entirely. Six weeks in, they have a green scorecard and a broken agent.

The fix is not more tooling. It is a specific sequence of steps that grounds your evaluation in reality before you automate anything. Here is that sequence.

Self-Healing Agents in Production: How to Build Systems That Fix Themselves

· 7 min read
Tian Pan
Software Engineer

Most agent failures don't announce themselves. There's no crash, no alert, no stack trace. Your agent just quietly returns wrong answers, skips tool calls, or stalls mid-task — and you find out three hours later when a user complains. The gap between "works in dev" and "reliable in production" isn't about adding more retries. It's about building a system that can detect its own failures, classify them, and recover without waking you up at 2am.

Here's what a self-healing agent pipeline actually looks like in practice.

The Anatomy of an Agent Harness

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI agents spend 80% of their time thinking about which model to use and 20% thinking about everything else. That ratio should be flipped. The model is almost interchangeable at this point — the harness is what determines whether your agent actually works in production.

The equation is simple: Agent = Model + Harness. If you're not the model, you're the harness. And the harness is where nearly all the real engineering lives.

How AI Agents Actually Learn Over Time

· 8 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the model as a fixed artifact. You pick a foundation model, write your prompts, wire up some tools, and ship. If the agent starts making mistakes, you tweak the system prompt or switch to a newer model. Learning, in this framing, happens upstream—at the AI lab, during pretraining and RLHF—not in your stack.

This is the wrong mental model. Agents that improve over time do so at three distinct architectural layers, and only one of them involves touching model weights. Teams that understand this distinction build systems that compound in quality; teams that don't keep manually patching the same failure modes.

Context Engineering for Personalization: How to Build Long-Term Memory Into AI Agents

· 8 min read
Tian Pan
Software Engineer

Most agent demos are stateless. A user asks a question, the agent answers, the session ends — and the next conversation starts from scratch. That's fine for a calculator. It's not fine for an assistant that's supposed to know you.

The gap between a useful agent and a frustrating one often comes down to one thing: whether the system remembers what matters. This post breaks down how to architect durable, personalized memory into production AI agents — covering the four-phase lifecycle, layered precedence rules, and the specific failure modes that will bite you if you skip the engineering.

Routines and Handoffs: The Two Primitives Behind Every Reliable Multi-Agent System

· 8 min read
Tian Pan
Software Engineer

Most multi-agent systems fail not because the models are wrong, but because the plumbing is leaky. Agents drop context mid-task, hand off to the wrong specialist, or loop indefinitely when they don't know how to exit. The underlying cause is almost always the same: the system was designed around what each agent can do, without clearly defining how work moves between them.

Two primitives fix most of this: routines and handoffs. They're deceptively simple, but getting them right is the difference between a demo that works and a system you can ship.

Measuring AI Agent Autonomy in Production: What the Data Actually Shows

· 7 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks on pre-deployment evals and almost nothing on measuring what their agents actually do in production. That's backwards. The metrics that matter—how long agents run unsupervised, how often they ask for help, how much risk they take on—only emerge at runtime, across thousands of real sessions. Without measuring these, you're flying blind.

A large-scale study of production agent behavior across thousands of deployments and software engineering sessions has surfaced some genuinely counterintuitive findings. The picture that emerges is not the one most builders expect.