Skip to main content

109 posts tagged with "mlops"

View all tags

Quantization Slippage: The Capability Tax Your Eval Set Was Never Built to Catch

· 11 min read
Tian Pan
Software Engineer

A self-hosted LLM team quantizes the production model from fp16 to int4. Memory drops 4×, throughput nearly doubles, the GPU bill shrinks, and the team reruns the same eval suite that gated the fp16 release. MMLU-Pro retains 98.1% of baseline. Aggregate quality looks fine. They ship.

Six weeks later, a support engineer notices the math tutoring feature has gotten quietly worse. The compliance team flags an uptick in policy-violation completions on adversarial prompts. The structured-output retry rate has crept from 1.4% to 6.8%. None of these show up on the eval dashboard, because the eval dashboard was built to validate a different model — the one that shared the same weights file but had four times more bits behind every activation.

This is quantization slippage. The cost analysis priced the memory win and the latency win. It did not price the eval re-anchoring that the swap silently demanded, and the eval suite, calibrated against the fp16 distribution, is now grading the wrong model with the wrong rubric.

Your Fine-Tuning Corpus Is a Codebase. Stop Shipping It Through a Bucket.

· 11 min read
Tian Pan
Software Engineer

By month nine of any serious fine-tuning project, your training corpus has more authors than your codebase. Synthetic generation pipelines wrote a few million examples. The vendor labeling firm contributed 80K rows from a workforce you have never met. An engineer added 47 examples last Tuesday to fix a regression they spotted in eval. A scraping job pulls production traces into a "supplementary" parquet file every night. A CSV someone dropped into S3 in February is still there, still in the training mix, and the person who wrote it left the company in March.

Now look at your application code repo. Every line is attributable to a named author. Every change went through a PR with at least one reviewer. Commits are signed. The main branch is protected. Merges require a second human. There is an audit log. If an auditor asks who wrote line 47 of payment_processor.py, you have an answer within seconds.

If they ask who wrote example 47 of the corpus that produced model v2.3, the honest answer is "a Mechanical Turk batch from 2024-Q2, vendor unknown, justification absent." Your fine-tuning corpus is a higher-privilege deployment surface than your codebase — it directly shapes model behavior in production — and you are shipping it through a bucket while you ship code through a reviewed PR. The threat model is inverted.

Production Bias Auditing: Catching AI Discrimination Before Your Users Do

· 11 min read
Tian Pan
Software Engineer

The most expensive bias bug I've seen in production was discovered by a Twitter thread, not a dashboard. A small team had shipped a credit-scoring assistant. They'd run the standard pre-launch audit: balanced training set, adversarial debiasing, equalized-odds gap under five percent on the holdout. A month after launch, a user posted screenshots showing women in their household consistently received lower limits than men with identical financials. By the time the team's monitoring caught up, the regulator had already opened an inquiry.

The lesson isn't that the team was lazy. They ran exactly the audit the literature recommends. The lesson is that pre-launch audits measure a snapshot of a model that no longer exists by the time real users hit it. Distribution shifts. New populations show up. A prompt-template change introduces a phrasing artifact that interacts with names. A model upgrade quietly trades calibration for a fluency win. The audit you ran in November does not protect the model running in production in May.

Fine-Tune Orphan: Recovering Domain Expertise When the Base Model Is Deprecated

· 9 min read
Tian Pan
Software Engineer

On January 4, 2024, OpenAI retired the /fine-tunes endpoint. Every fine-tuned Ada, Babbage, Curie, and Davinci model stopped responding. Teams that had spent months building production systems on these models — careful prompt design, annotated datasets, labeling pipelines — woke up to HTTP 404s. The fine-tunes didn't migrate. The learned behaviors didn't transfer. The domain expertise was gone.

This wasn't a fringe edge case. Google followed in August 2024 by completely decommissioning the PaLM API, with zero backwards-compatible grace period. Unlike OpenAI, which at least let existing GPT-3.5 fine-tunes keep running while blocking new training runs, Google's shutdown meant production inference stopped the same day. If your fine-tuned PaLM model was in the critical path, you had a service outage.

Why AI Quality Monitors Conflate Model Drift, Data Drift, and Prompt Drift — and What to Do About Each

· 10 min read
Tian Pan
Software Engineer

A fraud detection model's accuracy silently halved over three weeks. Latency was normal, error rates were zero, and every infrastructure dashboard was green. Engineers spent the first week auditing the data pipeline, the second week comparing model weights, and the third week reopening tickets before someone noticed that fraudsters had simply changed their language patterns. The fix — retraining on recent examples — took two days. The misdiagnosis took three weeks.

This pattern repeats across production AI teams: degradation sets off a generalized "model problem" alarm, and the team starts pulling levers based on intuition rather than root cause. The reason isn't a lack of monitoring discipline; it's that most observability stacks treat three structurally distinct problems as one. Model drift, data drift, and prompt drift have different detection signatures, different alert topologies, and different remediation paths. Conflating them is how weeks get wasted on the wrong fix.

Why Rolling Back an AI Feature Is Harder Than Rolling Back Code

· 9 min read
Tian Pan
Software Engineer

When a personality update made a popular AI assistant noticeably more flattering and complimentary, the engineering team quickly identified the problem and issued a rollback within days. The code change was clean. The model swap was straightforward. And users were furious anyway — not because the rollback was broken, but because some of them had already built workflows around the sycophantic version. Their prompt strategies, their review loops, their interpretation of the model's confidence signals — all of it had been tuned to an AI they no longer had access to.

Rolling back the code had taken hours. Rolling back the users was impossible.

This asymmetry is the central challenge of AI feature management that most engineering teams underestimate until they've been burned by it. Conventional rollback thinking treats "undo" as a purely technical operation. For AI features, that's only half the story.

The AI Incident Postmortem Nobody Writes: A Four-Layer Diagnosis Framework

· 11 min read
Tian Pan
Software Engineer

When a recommendation engine surfaced offensive content last quarter, the post-incident review produced a familiar outcome: a two-hour call where ML engineers pointed at the retrieval corpus, data engineers pointed at the prompt, product engineers pointed at monitoring, and infrastructure pointed at the model version that nobody remembered upgrading. Three action items were created. None had owners. The incident closed. The same failure mode shipped again six weeks later.

This is not a story about one team. It is the default ending for AI incidents at most organizations. Responsibility for what an AI feature does in production is distributed across enough parties that a standard postmortem cannot pin causation. The 5-why analysis that works well for database timeouts breaks when the failure is "the model gave the wrong answer" — because the correct next question is never obvious.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

· 9 min read
Tian Pan
Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.

Enterprise AI's Last Mile Problem: Why Most Pilots Never Reach Production

· 8 min read
Tian Pan
Software Engineer

A model that scores 94% on your internal benchmark, impresses stakeholders in a demo, and passes every offline evaluation can still reach production and drop to 7% effective accuracy on real customer data. This isn't a hypothetical. It's a documented outcome from multiple enterprise AI deployments, and it's one symptom of a broader pattern: the gap between "pilot success" and "production value" is where most enterprise AI quietly dies.

Across industries, roughly 85–88% of enterprise AI pilots never reach production. For every 33 PoCs an organization starts, only four ship. That ratio has barely moved in three years despite massive increases in model capability. The failure mode has nothing to do with whether the model is good enough — it's almost always about what happens between the successful demo and the moment a real user relies on the system to do real work.

The RAG Eval Invalidation Paradox: Why Updating Your Knowledge Base Breaks Your Benchmarks

· 10 min read
Tian Pan
Software Engineer

Your RAG eval suite passes at 0.89 faithfulness. You add 5,000 new support documents to the knowledge base. You re-run the same evals. Faithfulness drops to 0.79. Your team files a model regression ticket.

Nothing regressed. Your eval just became a lie.

This is the RAG eval invalidation paradox: the moment you update your knowledge base, the evaluation set you built against the old index silently stops measuring what it was designed to measure. Most teams discover this months later — after burning engineering cycles on phantom regressions — if they ever discover it at all.

The Retrograde Accuracy Problem: Why AI Features Degrade as Your Product Grows

· 10 min read
Tian Pan
Software Engineer

Your AI feature ships clean. Accuracy on the eval set: 91%. Latency: acceptable. The team is proud. Six months later, users are complaining that the feature feels "dumb," support tickets are climbing, and your aggregate metrics are quietly 8% worse than launch day. Nobody changed the model. The underlying data pipeline is intact. What happened?

This is the retrograde accuracy problem. As your product grows — new features, new user segments, new edge cases, new flows — the input distribution your AI sees in production quietly drifts away from the distribution it was trained on. No model update. No data pipeline failure. The product itself outgrew the model.

Scheduling Fairness in Multi-Tenant LLM Inference: Why FIFO Is the Wrong Default

· 11 min read
Tian Pan
Software Engineer

Your company runs a shared LLM serving cluster. Two tenants use it: a customer-facing chatbot with a 500ms first-token latency SLO, and a batch document enrichment pipeline that processes thousands of long-context prompts overnight. One morning, the chatbot team pages you at 3am because their P95 TTFT spiked to 12 seconds. Root cause: the batch job started earlier than expected, filled the GPU memory with prefill work, and the chatbot's short requests sat in queue behind a parade of 8,000-token prompts. Your FIFO scheduler gave them equal priority. The chatbot's SLO was violated 4,000 times before you killed the batch job manually.

This failure mode is common, well-understood in theory, and surprisingly widespread in practice. Most teams deploy vLLM or TGI with the default FIFO scheduler, add multiple workloads over time, and only discover the priority inversion when an incident happens.