720 posts tagged with "llm"

The Output Coupling Trap: Why Multi-Agent Systems Fail Silently at Interface Boundaries

May 4, 2026 · 9 min read

Software Engineer

Your multi-agent pipeline finished. No exceptions were raised. The orchestrator reported success. And yet, the answer is wrong in a way that makes no sense — the executor skipped two steps, the summarizer collapsed three sections into one non-sequitur, and the output looks like it came from a different task entirely. There's no stack trace to follow. No error code to search. Just a quietly incorrect result.

This is the output coupling trap. It's not a model quality problem. It's an interface engineering problem, and it's the leading cause of silent production failures in multi-agent systems.

The Consistency Gap: Why Parallel LLM Calls Contradict Each Other and How to Fix It

May 4, 2026 · 10 min read

Tian Pan

Software Engineer

Imagine a multi-agent pipeline that processes a user's support ticket. Agent A reads the ticket history and decides the user is a power user who needs an advanced response. Agent B reads the same ticket history in a parallel call and decides the user is a beginner who needs step-by-step guidance. Both agents finish at the same time and hand their outputs to a composer agent—which now has to reconcile two fundamentally incompatible mental models of the same person.

This isn't a rare edge case. Research analyzing production multi-agent failures found that 36.9% of failures are caused by inter-agent misalignment: conflicting outputs, context loss during handoffs, and incompatible conclusions reached independently. The consistency gap—the tendency for parallel LLM calls to contradict each other about shared entities—is one of the most underappreciated failure modes in agentic systems.

The Words You Choose in Your System Prompt Change What Your Agent Will Risk

May 4, 2026 · 8 min read

Tian Pan

Software Engineer

Here is something that shouldn't be surprising but is: when you tell an agent "avoid making mistakes" versus "prioritize accuracy," you are not giving it the same instruction. The observable behavior at ambiguous decision points diverges measurably — agents prompted with loss-avoidance framing hedge more, escalate more, and complete fewer tasks end-to-end. Agents prompted with gain-seeking framing complete more tasks but introduce more errors. The difference isn't philosophical; it shows up in eval logs.

This is the behavioral economics of agents, and most engineering teams haven't thought about it systematically. They write system prompts as documentation — a description of what the agent is — when system prompts are actually decision-shaping instruments that encode a risk posture whether the author intended that or not.

The Provider Behavioral Fingerprint: What Doesn't Survive a Model Switch

May 4, 2026 · 8 min read

Tian Pan

Software Engineer

When a cost spike, a model deprecation notice, or a competitor's benchmark forces you to swap providers, engineering teams typically evaluate the candidate on capability benchmarks and call it a migration plan. That process catches about half the problems. The other half aren't capability problems — they're behavioral ones: the invisible layer of formatting habits, refusal patterns, serialization quirks, and output conventions your production code has silently wired itself to over months of iteration.

The capability benchmark tells you whether the new model can do the task. The behavioral fingerprint tells you whether your codebase can survive the replacement.

The Rollout Sequencing Problem: Why Co-Deploying Model and Infrastructure Changes Destroys Observability

May 4, 2026 · 9 min read

Tian Pan

Software Engineer

Three weeks into your quarter, a production alert fires. Accuracy on a core task dropped eight percentage points. You open the dashboard and immediately notice three things that all landed in the same deploy window: a context length increase from 8k to 32k tokens, a model version upgrade from gpt-4-turbo-preview to gpt-4o, and a batch size change your infrastructure team pushed to improve throughput. None of the three changes individually was considered high-risk. Combined, they've created a debugging problem no one can solve cleanly.

Welcome to the rollout sequencing problem.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

May 4, 2026 · 9 min read

Tian Pan

Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered

May 4, 2026 · 10 min read

Tian Pan

Software Engineer

Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.

This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.

The Zero-Shot Wall: Why In-Context Examples Stop Working at Production Scale

May 4, 2026 · 8 min read

Tian Pan

Software Engineer

Most teams discover the zero-shot wall the same way: a new edge case breaks the model, they add an example to the prompt, it helps. Three months later they've got 40 examples, 6,000 tokens of context, the performance metrics haven't moved in weeks, and the prompt engineer who knows where every example came from just left the company.

Few-shot prompting is seductive because it works quickly. You observe a failure, you add a demonstration, the failure goes away. The feedback loop is tight and the wins feel free. What you don't notice is that each subsequent example is buying less than the last — and at some point you're spending tokens, latency, and cognitive overhead for improvements that round to zero.

This is the zero-shot wall: not a hard limit where performance drops off a cliff, but a zone of sharply diminishing returns where in-context learning has hit the ceiling of what it can accomplish for your task, and the only lever left is fine-tuning.

Multi-Region AI Deployment: Data Residency, Model Parity, and the Latency Tax Nobody Budgets

May 3, 2026 · 10 min read

Tian Pan

Software Engineer

When engineers budget for multi-region AI deployments, they typically account for two variables: infrastructure cost per region and replication overhead. What they consistently underestimate — sometimes catastrophically — are three costs that only appear once you're live: model parity gaps that make your EU cluster produce different outputs than your US cluster, KV cache isolation penalties that make every token in GDPR territory more expensive to generate, and silent compliance violations that trigger when your retry logic routes a French user's data through Virginia.

A German bank spent 14 months deploying a large open-source model on-premises to satisfy GDPR requirements. That's not unusual. What's unusual is that the engineers who proposed the architecture understood the compliance constraint upfront. Most don't until an incident report forces the conversation.

The 200-Token System Prompt That Beats Your 4000-Token One

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

May 2, 2026 · 13 min read

Tian Pan

Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

About Tian Pan