Skip to main content

49 posts tagged with "system-design"

View all tags

The Approval Queue That Became Your Critical Path

· 11 min read
Tian Pan
Software Engineer

The design doc said "human in the loop." The launch deck said "safe by default." The incident review six months later said the agent took ninety minutes to send a customer an invoice because the approver was at lunch. None of those documents were lying. They were describing the same component at different points on its load curve — and only one of them got the shape right.

When you put a human between an agent and an irreversible action, you have not added a safety primitive. You have added a service with a queue, a throughput limit, a quality-versus-load curve, and an availability profile. The team that ships the agent without naming that service has shipped a product whose critical path runs through a piece of infrastructure they refuse to operate.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Token Budgets Are a Scheduling Problem, Not a Prompt Problem

· 9 min read
Tian Pan
Software Engineer

When an agent gives a worse answer than it did last week, the first instinct is to blame the prompt. Someone reworks the system instructions, trims a few sentences, adds an example, and ships. Sometimes it helps. Often it does nothing, because the prompt was never the problem. The problem is that a single verbose tool result quietly consumed 18,000 tokens, pushed the actual task instructions into the low-attention middle of the context window, and left the model reasoning over a transcript that is 70% noise.

That is not a wording problem. That is a resource-allocation problem. And resource allocation has a name in systems engineering: scheduling. The context window is a fixed-size resource, multiple consumers compete for it, and right now most agent stacks "schedule" it the way a 1960s batch system scheduled memory — first come, first served, until it runs out.

The Undo Button Your Agent Assumes Exists

· 9 min read
Tian Pan
Software Engineer

Watch an agent reason through a multi-step task and you will notice something familiar: it plans the way you debug. Try an approach, look at the result, and if it is wrong, back out and try another. The agent talks about its plan as a tree of options it can explore, prune, and revisit. That mental model is correct inside a code sandbox, where every action has an implicit undo. It is dangerously wrong the moment the agent touches the world.

A sent email does not unsend. A charged card does not uncharge without a refund flow, a fee, and a customer who already saw the notification. A deleted row is gone unless someone wired up soft deletes. A posted Slack message has already been read. The agent's planning model has no native concept of the one-way door — the action that, once taken, removes the option of pretending it never happened.

This is not a model intelligence problem. A smarter model still does not know which of your tools is reversible, because reversibility is not a property of the action. It is a property of the system the action lands in. You have to tell it.

When No One Answers the Escalation: Human-in-the-Loop Is a Staffing Problem

· 10 min read
Tian Pan
Software Engineer

Every agent architecture diagram has a box labeled "escalate to human." It is drawn with a clean arrow, it satisfies the reviewer, and it makes the system feel safe. What the diagram never shows is the person on the other end of that arrow — whether they exist, whether they are awake, and whether they will answer before the agent's patience runs out.

Human-in-the-loop is sold as a design pattern. In production it behaves like a staffing problem. The pattern assumes a human is standing by; the staffing reality is that escalations do not arrive when humans are available — they arrive on their own schedule. A burst at 2am when an overnight batch job trips a guardrail. A long tail through lunch when half the reviewers are away from their desks. A steady drip that quietly outgrows the two-person team that looked sufficient during the demo, when the agent handled ten requests a day instead of ten thousand.

The gap between "we have an escalation path" and "escalations get answered" is where agentic systems fail in ways no eval catches. The eval measures whether the agent escalates correctly. It never measures whether anyone was there.

Human Override as a First-Class Feature: Designing AI Systems That Fail Gracefully to Human Control

· 10 min read
Tian Pan
Software Engineer

When an AI-powered customer support agent can't resolve an issue and escalates to a human, what happens next? In most systems: the customer is transferred cold, with no context, and must re-explain everything from the beginning. The human agent has no idea what the AI attempted, what information was collected, or why the handoff occurred.

This is the most common form of human override failure — not a dramatic AI meltdown, but a quiet UX collapse at the seam between automated and human handling. It happens because engineers built the AI path carefully and treated human takeover as an afterthought, a fallback for when things go wrong. The result is that override feels like a system error rather than a designed operational mode.

The engineering teams that get this right treat human override as a first-class feature from day one. Here's what that looks like in practice.

Conflicting Instructions in System Prompts: The Silent Failure Mode No One Owns

· 10 min read
Tian Pan
Software Engineer

Your AI feature worked great at launch. Six months later it sometimes gives terse one-liners, sometimes writes five-paragraph essays, and occasionally refuses to answer questions it handled without complaint last quarter. Nothing in the codebase changed — or so you think. The system prompt changed, incrementally, through eleven pull requests authored by four engineers across two teams. Each change was individually sensible. Collectively, they turned your prompt into a contradiction machine.

This is the instruction contradiction problem. It does not throw an exception. It does not appear in error logs. It manifests as behavioral drift — the model doing subtly different things in subtly different situations in ways that are hard to reproduce and harder to attribute. By the time a user files a bug, the prompt has already been patched twice more.

AI-Native API Design: Building Backends That Agents Can Actually Use

· 10 min read
Tian Pan
Software Engineer

Your REST API works fine. Documentation is thorough. Error codes are consistent. Every human-authored client you've ever tested handles it well. Then your team integrates an AI agent and within an hour it's generated 2,000 failed requests by retrying variations of an endpoint that doesn't exist — bulk_search_users, search_all_users, bulk_user_search — each attempt triggering real downstream processing.

This isn't a prompt engineering failure. It's an API design failure.

REST APIs were built for clients that parse documentation, respect contracts, and call exactly what's specified. AI agents are different: they reason about what an endpoint probably does based on names and descriptions, retry without tracking state, and treat error messages as instructions rather than diagnostic codes. Designing an API for an agentic caller requires rethinking assumptions that most backend engineers have never had to question.

The Human Bottleneck Problem: When Human-in-the-Loop Becomes Your Slowest Microservice

· 9 min read
Tian Pan
Software Engineer

Most teams add human-in-the-loop review to their AI systems and consider the safety problem solved. Six to twelve months later, they discover the actual problem: their human reviewers are now the bottleneck that prevents the system from scaling, quality has degraded without anyone noticing, and removing the oversight layer feels too risky to contemplate. They are stuck.

This is the HITL throughput failure. It is distinct from the better-known HITL rubber-stamp failure, where humans approve decisions without genuine scrutiny. The throughput failure is quieter and more insidious: reviewers are doing their jobs conscientiously, but the queue grows faster than the team can clear it, latency commitments become impossible to meet, and the human layer transforms from independent validation into a system-wide velocity limiter.

The Population Prompt Problem: Why Your System Prompt Works for 80% of Users and Silently Fails the Other 20%

· 10 min read
Tian Pan
Software Engineer

When you write a system prompt, you have a user in mind. Maybe it's the competent professional asking a focused question in clear English. Maybe it's someone who sends a short, well-scoped request that fits neatly inside your prompt's assumptions. You test against examples that feel representative, tune until the outputs look good, and ship.

Then you see production traffic.

The real population of queries your system prompt must handle is not the median case you designed for. It's a distribution — some narrow, many diffuse — with a long tail of edge cases that expose every assumption baked into your instructions. For most production systems, somewhere between 15% and 30% of real queries fall into categories the prompt handles poorly. The unsettling part: most of these failures are silent. Your system returns a 200, the user gets an answer that looks plausible, and the failure never surfaces in your logs.

AI Fallback Design Is an Architecture Problem, Not an Afterthought

· 9 min read
Tian Pan
Software Engineer

When McDonald's pulled the plug on its AI drive-thru after three years of operation, the failure wasn't that the model was bad at understanding orders. The failure was architectural: there was no clear escalation path to a human cashier, no confidence threshold that would trigger a retry, and no defined behavior for the system when it was confused. The AI just kept trying. Customers kept getting frustrated. The happy path was well-designed. Everything else wasn't.

That pattern repeats across almost every failed AI deployment. The model works in demos. It fails in production. And the post-mortem reveals the same root cause: fallback design was never part of the architecture. It was something someone planned to add later.

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.