Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Invisible Handoff: Why Production AI Failures Cluster at Component Boundaries

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships a wrong answer, the first question is always: "Was it the model?" Most engineers reach for model evaluation, run a few test prompts, and conclude the model looks fine. They're usually right. The model is fine. The breakage happened somewhere else—at one of the invisible seams where your components talk to each other.

The evidence for this is consistent. Analysis of production RAG deployments shows 73% of failures are retrieval failures, not generation failures. In multi-agent systems, the most common failure modes are message ordering violations, state synchronization gaps, and schema mismatches—none of which show up in any per-component health check. GPT-4 produces invalid responses on complex extraction tasks nearly 12% of the time, not because the model is broken, but because the output format contract between the model and the downstream parser was never enforced.

The model gets blamed. The boundary is the culprit.

The Consistency Gap: Why Parallel LLM Calls Contradict Each Other and How to Fix It

· 10 min read
Tian Pan
Software Engineer

Imagine a multi-agent pipeline that processes a user's support ticket. Agent A reads the ticket history and decides the user is a power user who needs an advanced response. Agent B reads the same ticket history in a parallel call and decides the user is a beginner who needs step-by-step guidance. Both agents finish at the same time and hand their outputs to a composer agent—which now has to reconcile two fundamentally incompatible mental models of the same person.

This isn't a rare edge case. Research analyzing production multi-agent failures found that 36.9% of failures are caused by inter-agent misalignment: conflicting outputs, context loss during handoffs, and incompatible conclusions reached independently. The consistency gap—the tendency for parallel LLM calls to contradict each other about shared entities—is one of the most underappreciated failure modes in agentic systems.

The Provider Behavioral Fingerprint: What Doesn't Survive a Model Switch

· 8 min read
Tian Pan
Software Engineer

When a cost spike, a model deprecation notice, or a competitor's benchmark forces you to swap providers, engineering teams typically evaluate the candidate on capability benchmarks and call it a migration plan. That process catches about half the problems. The other half aren't capability problems — they're behavioral ones: the invisible layer of formatting habits, refusal patterns, serialization quirks, and output conventions your production code has silently wired itself to over months of iteration.

The capability benchmark tells you whether the new model can do the task. The behavioral fingerprint tells you whether your codebase can survive the replacement.

The Rollout Sequencing Problem: Why Co-Deploying Model and Infrastructure Changes Destroys Observability

· 9 min read
Tian Pan
Software Engineer

Three weeks into your quarter, a production alert fires. Accuracy on a core task dropped eight percentage points. You open the dashboard and immediately notice three things that all landed in the same deploy window: a context length increase from 8k to 32k tokens, a model version upgrade from gpt-4-turbo-preview to gpt-4o, and a batch size change your infrastructure team pushed to improve throughput. None of the three changes individually was considered high-risk. Combined, they've created a debugging problem no one can solve cleanly.

Welcome to the rollout sequencing problem.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

· 9 min read
Tian Pan
Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

The 200-Token System Prompt That Beats Your 4000-Token One

· 10 min read
Tian Pan
Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What

· 11 min read
Tian Pan
Software Engineer

A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.

You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.

The agent worked. The trace is intact. The hairball is the bug.

The Agentic Debugger's Trap: When Your Agent Patches Faster Than You Can Diagnose

· 10 min read
Tian Pan
Software Engineer

A staff engineer I worked with last quarter caught a bug that had already been "fixed" three times in the previous six weeks. Three different engineers. Three different files. Three green CI runs. Three accepted agent-generated patches. Each patch made the failing test pass and the user-reported error disappear. Each one moved the bug somewhere else, where it waited until a different surface area triggered it again. The fourth time it surfaced, the data corruption it caused had been silently compounding for forty days.

The bug was a single off-by-one in a pagination cursor. The agent had been right that the symptom would go away. It had been wrong about why. And the engineers — competent, senior, well-intentioned — had each accepted a passing patch before they understood the failure mechanism.

This is the agentic debugger's trap: your agent can produce a fix faster than you can build the mental model needed to evaluate whether the fix is correct. Patch velocity outruns diagnosis. The bug count drops, the CI dashboard goes green, and you ship a codebase whose failure modes you no longer understand.

The AI Bystander Effect: Why Five-Team Launches Ship Eval Suites Nobody Watches

· 10 min read
Tian Pan
Software Engineer

In 1964, thirty-eight people watched Kitty Genovese being attacked outside their apartment building in Queens. None of them called the police until it was too late. Latané and Darley spent the next decade explaining why: the more people who can see a problem, the less likely any single one of them is to act. They called it diffusion of responsibility. In their famous seizure experiment, 85% of participants intervened when they thought they were alone with the victim. When they believed four others could also hear the seizure, only 31% did.

Now picture your last AI feature launch. Product wrote the prompt. Engineering picked the model and wired the gateway. The data team curated the retrieval corpus. Safety bolted on the input and output filters. Customer support drafted the escalation playbook. Five teams in the room. Each one shipped its piece on time. Three months in, the feature's accuracy has quietly slid from 89% to 71%, the eval suite has not been run since launch week, and when you ask who owns the regression, every team can name three other teams that own it more.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

· 13 min read
Tian Pan
Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

AI Feature Soak Windows: Why a Two-Week Canary Misses What Actually Matters

· 13 min read
Tian Pan
Software Engineer

The two-week canary is one of those practices that sounds disciplined enough to skip the harder question. Engineering imported it from microservices — ramp 1% for a few days, watch error rate, ramp to 100%, declare done — and grafted it onto AI features without asking whether the failure modes that matter for AI even surface in two weeks. They don't. The bill that kills the feature lands in week six. The customer cohort that exposes the long-tail intent onboards in week five. The eval drift that scored +3% on launch day starts costing real money in week four because the new prompt's chattier outputs have been compounding token spend the whole time, and nobody was watching for that because the dashboard was watching for crashes.

A canary built around p95 latency and HTTP 500s will tell you the LLM is up. It will not tell you the feature is working. AI features fail in shapes the deploy ceremony was never designed to catch — slow shape changes in user behavior, gradual cache erosion, retrieval quality collapse, refusal-rate creep, cost trajectories that bend the wrong way — and almost all of them take longer than two weeks to declare themselves. The team that ships by the microservice clock is shipping by a clock the failures don't run on.

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

· 11 min read
Tian Pan
Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."