Skip to main content

51 posts tagged with "ai-agents"

View all tags

12-Factor Agents: A Framework for Building AI Systems That Actually Ship

· 11 min read
Tian Pan
Software Engineer

The teams actually shipping reliable AI agents to production customers are mostly not using agent frameworks. They rolled their own.

That observation, surfaced from conversations with 100+ technical founders, is the uncomfortable starting point for the 12-Factor Agents framework — a manifesto for building LLM-powered software that reaches production instead of languishing at 80% quality forever. The framework is named deliberately after the original 12-Factor App methodology that shaped a generation of web services. The analogy holds: just as the 12-factor app gave teams a principled approach to building deployable web services, 12-factor agents provides the principles for building reliable, observable AI systems.

The 19,000-star GitHub repository documents what the best-performing production teams figured out independently. Here is what they know.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

· 9 min read
Tian Pan
Software Engineer

Most AI agent demos work beautifully.

They run in under 30 seconds, hit three tools, and return a clean result. Then someone asks the agent to do something that actually matters — cross-reference a codebase, run a multi-stage data pipeline, process a batch of documents — and the whole thing falls apart in a cascade of timeouts, partial state, and duplicate side effects.

The problem is not the model. It is the infrastructure. Agents that run for minutes or hours face a completely different class of systems problems than agents that finish in seconds, and most teams hit this wall at the worst possible time: after they have already shipped something users depend on.

Context Engineering: Why What You Feed the LLM Matters More Than How You Ask

· 11 min read
Tian Pan
Software Engineer

Most LLM quality problems aren't prompt problems. They're context problems.

You spend hours crafting the perfect system prompt. You add XML tags, chain-of-thought instructions, and careful persona definitions. You test it on a handful of inputs and it looks great. Then you ship it, and two weeks later you're staring at a ticket where the agent confidently told a user the wrong account balance — because it retrieved the previous user's transaction history. The model understood the instructions perfectly. It just had the wrong inputs.

This is the core distinction between prompt engineering and context engineering. Prompt engineering asks: "How should I phrase this?" Context engineering asks: "What does the model need to know right now, and how do I make sure it gets exactly that?" One is copywriting. The other is systems architecture.

Memory Architectures for Production AI Agents

· 10 min read
Tian Pan
Software Engineer

Most teams add memory to their agents as an afterthought — usually after a user complains that the agent forgot something it was explicitly told three sessions ago. At that point, the fix feels obvious: store conversations somewhere and retrieve them later. But this intuition leads to systems that work in demos and fall apart in production. The gap between a memory system that stores things and one that reliably surfaces the right things at the right time is where most agent projects quietly fail.

Memory architecture is not a peripheral concern. For any agent handling multi-session interactions — customer support, coding assistants, research tools, voice interfaces — memory is the difference between a stateful assistant and a very expensive autocomplete. Getting it wrong doesn't produce crashes; it produces agents that feel subtly broken, that contradict themselves, or that confidently repeat outdated information the user corrected two weeks ago.

Cloud Agents Are Rewriting How Software Gets Built

· 7 min read
Tian Pan
Software Engineer

The first time an AI coding agent broke a team's CI pipeline—not by writing bad code, but by generating pull requests faster than GitHub Actions could process them—it became clear something fundamental had shifted. We were no longer talking about a smarter autocomplete. We were talking about a different model of software production entirely.

The arc of AI-assisted coding has moved quickly. Autocomplete tools changed how individuals typed. Local agents changed what a single session could accomplish. Cloud agents are now changing how teams build software—parallelizing work across multiple asynchronous threads, running tests before handing off PRs, and increasingly handling 3-hour tasks while developers sleep or move on to other problems.

How AI Agents Actually Work: Architecture, Planning, and Failure Modes

· 10 min read
Tian Pan
Software Engineer

Most agent failures are architecture failures. The model gets blamed when a task goes sideways, but nine times out of ten, the real problem is that nobody thought hard enough about how planning, tool use, and reflection should fit together. You can swap in a better model and still get the same crashes — because the scaffolding around the model was never designed to handle what the model was being asked to do.

This post is a practical guide to how agents actually work under the hood: what the core components are, where plans go wrong, how reflection loops help (and when they hurt), and what multi-agent systems look like when you're building them for production rather than demos.

LLM-Powered Autonomous Agents: The Architecture Behind Real Autonomy

· 8 min read
Tian Pan
Software Engineer

Most teams that claim to have "agents in production" don't. Surveys consistently show that around 57% of engineering organizations have deployed AI agents — but when you apply rigorous criteria (the LLM must plan, act, observe feedback, and adapt based on results), only 16% of enterprise deployments and 27% of startup deployments qualify as true agents. The rest are glorified chatbots with tool calls bolted on.

This gap isn't about model capability. It's about architecture. Genuine autonomous agents require three interlocking subsystems working in concert: planning, memory, and tool use. Most implementations get one right, partially implement a second, and ignore the third. The result is a system that works beautifully in demos and fails unpredictably in production.

The Agent Evaluation Readiness Checklist

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they start with the evaluation infrastructure before they understand what failure looks like. They instrument dashboards, choose metrics, wire up graders — and then discover their evals are measuring the wrong things entirely. Six weeks in, they have a green scorecard and a broken agent.

The fix is not more tooling. It is a specific sequence of steps that grounds your evaluation in reality before you automate anything. Here is that sequence.

Self-Healing Agents in Production: How to Build Systems That Fix Themselves

· 7 min read
Tian Pan
Software Engineer

Most agent failures don't announce themselves. There's no crash, no alert, no stack trace. Your agent just quietly returns wrong answers, skips tool calls, or stalls mid-task — and you find out three hours later when a user complains. The gap between "works in dev" and "reliable in production" isn't about adding more retries. It's about building a system that can detect its own failures, classify them, and recover without waking you up at 2am.

Here's what a self-healing agent pipeline actually looks like in practice.

The Anatomy of an Agent Harness

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI agents spend 80% of their time thinking about which model to use and 20% thinking about everything else. That ratio should be flipped. The model is almost interchangeable at this point — the harness is what determines whether your agent actually works in production.

The equation is simple: Agent = Model + Harness. If you're not the model, you're the harness. And the harness is where nearly all the real engineering lives.

How AI Agents Actually Learn Over Time

· 8 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the model as a fixed artifact. You pick a foundation model, write your prompts, wire up some tools, and ship. If the agent starts making mistakes, you tweak the system prompt or switch to a newer model. Learning, in this framing, happens upstream—at the AI lab, during pretraining and RLHF—not in your stack.

This is the wrong mental model. Agents that improve over time do so at three distinct architectural layers, and only one of them involves touching model weights. Teams that understand this distinction build systems that compound in quality; teams that don't keep manually patching the same failure modes.