780 posts tagged with "ai-engineering"

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

April 20, 2026 · 10 min read

Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Zero-Shot, Few-Shot, or Chain-of-Thought: A Production Decision Framework

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Ask most engineers why they're using few-shot prompting in production, and you'll hear something like: "It seemed to work better." Ask why they added chain-of-thought, and the answer is usually: "I read it helps with reasoning." These aren't wrong answers, exactly. But they're convention masquerading as engineering. The evidence on when each prompting technique actually outperforms is specific enough that you can make this decision systematically—and the right choice can cut token costs by 60–80% or prevent a degradation you didn't know you were causing.

Here's what the research says, and how to apply it to your stack.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Reasoning Model Economics: When Chain-of-Thought Earns Its Cost

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team at a mid-size SaaS company added "let's think step by step" to every prompt after reading a few benchmarks. Their response quality went up measurably — and their LLM bill tripled. When they dug into the logs, they found that most of the extra tokens were being spent on tasks like classifying support tickets and summarizing meeting notes, where the additional reasoning added nothing detectable to output quality.

Extended thinking models are a genuine capability leap for hard problems. They're also a reliable cost trap when applied indiscriminately. The difference between a well-tuned reasoning deployment and an expensive one often comes down to one thing: understanding which tasks actually benefit from chain-of-thought, and which tasks are just paying for elaborate narration of obvious steps.

Shadow to Autopilot: A Readiness Framework for AI Feature Autonomy

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

When a fintech company first deployed an AI transaction approval agent, the product team was convinced the model was ready for autonomy after a week of positive offline evals. They pushed it to co-pilot mode — where the agent suggested approvals and humans could override — and the approval rates looked great. Three weeks later, a pattern surfaced: the model was systematically under-approving transactions from non-English-speaking users in ways that correlated with name patterns, not risk signals. No one had checked segment-level performance before the rollout. The model wasn't a fraud-detection failure. It was a stage-gate failure.

Most teams understand, in principle, that AI features should be rolled out gradually. What they don't have is a concrete engineering framework for what "gradual" actually means: which metrics unlock each stage, what monitoring is required before escalation, and what triggers an automatic rollback. Without these, autonomy escalation becomes an act of organizational optimism rather than a repeatable engineering decision.

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

What 99.9% Uptime Means When Your Model Is Occasionally Wrong

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A telecom company ships an AI support chatbot with 99.99% availability and sub-200ms response times — every traditional SLA metric is green. It is also wrong on 35% of billing inquiries. No contract clause covers that. No alert fires. The customer just churns.

This is the watermelon effect for AI: systems that look healthy on the outside while quietly rotting inside. Traditional reliability SLAs — uptime, error rate, latency — were built for deterministic systems. They measure whether your service answered, not whether the answer was any good. Shipping an AI feature under a traditional SLA is like guaranteeing that every email your support team sends will be delivered, without any commitment that the replies make sense.

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.

This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.

This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.

System Prompt Sprawl: When Your AI Instructions Become a Source of Bugs

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams discover the system prompt sprawl problem the hard way. The AI feature launches, users find edge cases, and the fix is always the same: add another instruction. After six months you have a 4,000-token system prompt that nobody can fully hold in their head. The model starts doing things nobody intended — not because it's broken, but because the instructions you wrote contradict each other in subtle ways the model is quietly resolving on your behalf.

Sprawl isn't a catastrophic failure. That's what makes it dangerous. The model doesn't crash or throw an error when your instructions conflict. It makes a choice, usually fluently, usually plausibly, and usually in a way that's wrong just often enough to be a real support burden.

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

About Tian Pan