Skip to main content

44 posts tagged with "system-design"

View all tags

Human Override as a First-Class Feature: Designing AI Systems That Fail Gracefully to Human Control

· 10 min read
Tian Pan
Software Engineer

When an AI-powered customer support agent can't resolve an issue and escalates to a human, what happens next? In most systems: the customer is transferred cold, with no context, and must re-explain everything from the beginning. The human agent has no idea what the AI attempted, what information was collected, or why the handoff occurred.

This is the most common form of human override failure — not a dramatic AI meltdown, but a quiet UX collapse at the seam between automated and human handling. It happens because engineers built the AI path carefully and treated human takeover as an afterthought, a fallback for when things go wrong. The result is that override feels like a system error rather than a designed operational mode.

The engineering teams that get this right treat human override as a first-class feature from day one. Here's what that looks like in practice.

Conflicting Instructions in System Prompts: The Silent Failure Mode No One Owns

· 10 min read
Tian Pan
Software Engineer

Your AI feature worked great at launch. Six months later it sometimes gives terse one-liners, sometimes writes five-paragraph essays, and occasionally refuses to answer questions it handled without complaint last quarter. Nothing in the codebase changed — or so you think. The system prompt changed, incrementally, through eleven pull requests authored by four engineers across two teams. Each change was individually sensible. Collectively, they turned your prompt into a contradiction machine.

This is the instruction contradiction problem. It does not throw an exception. It does not appear in error logs. It manifests as behavioral drift — the model doing subtly different things in subtly different situations in ways that are hard to reproduce and harder to attribute. By the time a user files a bug, the prompt has already been patched twice more.

AI-Native API Design: Building Backends That Agents Can Actually Use

· 10 min read
Tian Pan
Software Engineer

Your REST API works fine. Documentation is thorough. Error codes are consistent. Every human-authored client you've ever tested handles it well. Then your team integrates an AI agent and within an hour it's generated 2,000 failed requests by retrying variations of an endpoint that doesn't exist — bulk_search_users, search_all_users, bulk_user_search — each attempt triggering real downstream processing.

This isn't a prompt engineering failure. It's an API design failure.

REST APIs were built for clients that parse documentation, respect contracts, and call exactly what's specified. AI agents are different: they reason about what an endpoint probably does based on names and descriptions, retry without tracking state, and treat error messages as instructions rather than diagnostic codes. Designing an API for an agentic caller requires rethinking assumptions that most backend engineers have never had to question.

The Human Bottleneck Problem: When Human-in-the-Loop Becomes Your Slowest Microservice

· 9 min read
Tian Pan
Software Engineer

Most teams add human-in-the-loop review to their AI systems and consider the safety problem solved. Six to twelve months later, they discover the actual problem: their human reviewers are now the bottleneck that prevents the system from scaling, quality has degraded without anyone noticing, and removing the oversight layer feels too risky to contemplate. They are stuck.

This is the HITL throughput failure. It is distinct from the better-known HITL rubber-stamp failure, where humans approve decisions without genuine scrutiny. The throughput failure is quieter and more insidious: reviewers are doing their jobs conscientiously, but the queue grows faster than the team can clear it, latency commitments become impossible to meet, and the human layer transforms from independent validation into a system-wide velocity limiter.

The Population Prompt Problem: Why Your System Prompt Works for 80% of Users and Silently Fails the Other 20%

· 10 min read
Tian Pan
Software Engineer

When you write a system prompt, you have a user in mind. Maybe it's the competent professional asking a focused question in clear English. Maybe it's someone who sends a short, well-scoped request that fits neatly inside your prompt's assumptions. You test against examples that feel representative, tune until the outputs look good, and ship.

Then you see production traffic.

The real population of queries your system prompt must handle is not the median case you designed for. It's a distribution — some narrow, many diffuse — with a long tail of edge cases that expose every assumption baked into your instructions. For most production systems, somewhere between 15% and 30% of real queries fall into categories the prompt handles poorly. The unsettling part: most of these failures are silent. Your system returns a 200, the user gets an answer that looks plausible, and the failure never surfaces in your logs.

AI Fallback Design Is an Architecture Problem, Not an Afterthought

· 9 min read
Tian Pan
Software Engineer

When McDonald's pulled the plug on its AI drive-thru after three years of operation, the failure wasn't that the model was bad at understanding orders. The failure was architectural: there was no clear escalation path to a human cashier, no confidence threshold that would trigger a retry, and no defined behavior for the system when it was confused. The AI just kept trying. Customers kept getting frustrated. The happy path was well-designed. Everything else wasn't.

That pattern repeats across almost every failed AI deployment. The model works in demos. It fails in production. And the post-mortem reveals the same root cause: fallback design was never part of the architecture. It was something someone planned to add later.

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.

The Co-Pilot Trap: Why Full Autopilot Ships Faster but Fails Harder

· 9 min read
Tian Pan
Software Engineer

There's a pattern in how AI features die in production: they start as copilots and get promoted to autopilots. The promotion happens for obvious reasons—cost reduction, scale, reduced headcount—and the reasoning sounds solid at demo time. Then the edge cases accumulate. A user-facing recommendation becomes a user-facing decision. A suggestion becomes an action. And when the first systematic failure lands, the engineering team discovers that the error tolerance assumptions baked into the original design were never re-evaluated.

This is the co-pilot trap: building an AI feature for one tier of the automation spectrum, then promoting it to a higher tier without rebuilding the failure model that tier requires.

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.