Skip to main content

164 posts tagged with "production-ai"

View all tags

Quarterly Model Migration: Make It a Calendar Event, Not a Fire Drill

· 11 min read
Tian Pan
Software Engineer

The deprecation email arrives on a Tuesday afternoon. The model your billing pipeline has depended on for fourteen months is now on a sixty-day timer. The prompt was tuned by an engineer who left in March. The eval suite hasn't been re-baselined since launch. The customer-success team is asking why "the AI feels different" on two enterprise accounts. Nobody put this on the roadmap, and nobody will own it cleanly, because in your org's mental model this is a one-off project — even though it is the fourth one this year.

Every team running an AI feature in production runs into the same realization within eighteen months: the foundation-model provider is operating on a deprecation cadence that the team did not plan for, and the team's migration response keeps being a reactive scramble triggered by a notification email. The fix is not a better playbook for the next migration — there are already plenty of those, and your team has probably written one. The fix is to stop treating migration as a project and start treating it as a recurring operational primitive. Put it on the calendar.

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

· 10 min read
Tian Pan
Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.

Why AI Quality Monitors Conflate Model Drift, Data Drift, and Prompt Drift — and What to Do About Each

· 10 min read
Tian Pan
Software Engineer

A fraud detection model's accuracy silently halved over three weeks. Latency was normal, error rates were zero, and every infrastructure dashboard was green. Engineers spent the first week auditing the data pipeline, the second week comparing model weights, and the third week reopening tickets before someone noticed that fraudsters had simply changed their language patterns. The fix — retraining on recent examples — took two days. The misdiagnosis took three weeks.

This pattern repeats across production AI teams: degradation sets off a generalized "model problem" alarm, and the team starts pulling levers based on intuition rather than root cause. The reason isn't a lack of monitoring discipline; it's that most observability stacks treat three structurally distinct problems as one. Model drift, data drift, and prompt drift have different detection signatures, different alert topologies, and different remediation paths. Conflating them is how weeks get wasted on the wrong fix.

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

The Domain Expert Bottleneck in RAG: Why Knowledge Curation Breaks Production AI

· 7 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend their first month on the pipeline — chunking strategy, embedding model selection, vector store configuration, retrieval tuning. They get that working. The demo passes. Stakeholders are impressed.

Then six months later, the system starts quietly degrading. Support tickets reference wrong procedures. The bot cites a pricing tier that was retired in Q3. A customer gets a confident answer about a product feature that was deprecated before they even signed up. The pipeline is fine. The knowledge base is the problem.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

· 9 min read
Tian Pan
Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.

Explanation Debt: Why Users Deserve to Know What Your AI Did

· 8 min read
Tian Pan
Software Engineer

A loan application gets rejected. A candidate gets filtered out of a hiring pipeline. A medical imaging tool flags a scan as abnormal. In each case, an AI system made a decision that matters—and the user has no idea why.

Teams building these systems often spent months tuning precision, recall, and output quality. They ran A/B tests, iterated on prompts, and shipped a model that gets the right answer 94% of the time. But they never built the layer that tells users what happened. This is explanation debt: the accumulated cost of shipping AI decisions without the attribution, confidence signals, and recourse affordances that make those decisions interpretable.

Human Override as a First-Class Feature: Designing AI Systems That Fail Gracefully to Human Control

· 10 min read
Tian Pan
Software Engineer

When an AI-powered customer support agent can't resolve an issue and escalates to a human, what happens next? In most systems: the customer is transferred cold, with no context, and must re-explain everything from the beginning. The human agent has no idea what the AI attempted, what information was collected, or why the handoff occurred.

This is the most common form of human override failure — not a dramatic AI meltdown, but a quiet UX collapse at the seam between automated and human handling. It happens because engineers built the AI path carefully and treated human takeover as an afterthought, a fallback for when things go wrong. The result is that override feels like a system error rather than a designed operational mode.

The engineering teams that get this right treat human override as a first-class feature from day one. Here's what that looks like in practice.

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

· 10 min read
Tian Pan
Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

Conflicting Instructions in System Prompts: The Silent Failure Mode No One Owns

· 10 min read
Tian Pan
Software Engineer

Your AI feature worked great at launch. Six months later it sometimes gives terse one-liners, sometimes writes five-paragraph essays, and occasionally refuses to answer questions it handled without complaint last quarter. Nothing in the codebase changed — or so you think. The system prompt changed, incrementally, through eleven pull requests authored by four engineers across two teams. Each change was individually sensible. Collectively, they turned your prompt into a contradiction machine.

This is the instruction contradiction problem. It does not throw an exception. It does not appear in error logs. It manifests as behavioral drift — the model doing subtly different things in subtly different situations in ways that are hard to reproduce and harder to attribute. By the time a user files a bug, the prompt has already been patched twice more.

The Knowledge Half-Life Problem: Why Your RAG System Is Already Wrong

· 9 min read
Tian Pan
Software Engineer

Your RAG system passed all the retrieval benchmarks. Precision looks solid. The LLM-as-judge eval scores are green. And yet, somewhere in your index, there is a document describing an API endpoint that was deprecated eight months ago, a pricing tier that no longer exists, and a compliance policy that was superseded by new regulations in Q3. Your retriever has no idea. Semantic similarity has no concept of time.

This is the knowledge half-life problem: the silent failure mode where RAG systems appear healthy on every metric you're measuring while serving increasingly stale decisions to users. Seventy-three percent of organizations report accuracy degradation in RAG deployments within 90 days — not from poor retrieval architecture or embedding model quality, but from knowledge staleness that no one modeled as a reliability concern.

AI Pipeline Exception Handling: Hallucinations, Refusals, and Format Violations Are First-Class Errors

· 10 min read
Tian Pan
Software Engineer

Your AI pipeline reported zero errors last night. The output was completely wrong.

That's not a hypothetical. A recent industry report found that roughly 1 in 20 production LLM requests fail in ways that never surface as exceptions — valid HTTP 200, well-formed JSON, fluent prose, factually wrong. The observability stack stays green while the pipeline quietly lies to its users.

The root cause is an architectural assumption borrowed from traditional service engineering: that HTTP status codes and parse errors cover the failure space. They don't. LLM pipelines have at least four failure types that the underlying infrastructure cannot see — hallucinations, refusals, format violations, and context overflow — and treating them as edge cases instead of first-class error types is how production AI systems ship invisible bugs at scale.