Skip to main content

129 posts tagged with "mlops"

View all tags

The Canary Cohort Your Rollout Hashed by ID That Clustered Power Users Into One Arm

· 10 min read
Tian Pan
Software Engineer

A rollout team ships a new model behind a percentage flag. The flag bucket is computed as hash(user_id) % 100, the canary is buckets 0–4, the lift on per-user engagement is large and stable for two weeks, and the team ramps to 20%, then 50%, then global. The lift evaporates somewhere between 50% and global, and the post-mortem traces it back to the canary cohort. The treatment didn't move the metric. The canary arm was a different population.

The team thought it had been sampling users. It had been sampling IDs.

The Eval Harness That Ran on Yesterday's Prompt Template After Your Team Shipped a New One

· 9 min read
Tian Pan
Software Engineer

The incident timeline reads cleanly. At 9:02 your platform team pushed prompt-template@v38 to the config service. At 11:14 your dashboards showed everything green. At 16:51 someone in support flagged a spike in escalations. At 17:03 you opened the eval suite, found a regression score of 0.34, and rolled back. The post-mortem says "caught in eight hours, no customer harm beyond the 0.04% who saw it." Engineering leadership applauds the response time.

It is wrong. The regression was caught in zero hours. The eval suite running at 17:03 was the same eval suite running at 09:03. It had been pointed at v37 the entire time. The harness loaded the template from your config service at process startup, cached the rendered prompts as Python objects in module-level scope, and never reread the source. Your live traffic moved to v38 at 9am. Your eval moved at 17:03, when someone restarted the worker pool to "rerun the regression." Eight hours of customer interactions ran against a prompt that no eval had ever scored, while the eval kept grading a prompt that no production request was using.

The Fine-Tune Artifact Your Departing Engineer Took With Them

· 12 min read
Tian Pan
Software Engineer

A fine-tune is not a file. It is the closure of a pipeline over a training set, and the team that ships the file without the closure has built a production dependency whose source code is in someone else's head. The day that person leaves with two weeks of notice and a clean handoff document is the day your bus factor on a revenue feature drops to zero and nobody notices, because the weights are still in the registry and the registry tag is still stable and the model still serves traffic. The reckoning shows up later, in a routine base-model migration that should have taken a sprint and takes a quarter instead.

The pattern is consistent across teams I have watched run into it. An ML engineer spends six months iterating on a fine-tune — data curation, hyperparameter sweeps, behavioral patches evaluated by feel against a held-out set. The final adapter weights get pushed to the model registry with a tag. The training pipeline that produced those weights is a notebook on the engineer's laptop, with hard-coded paths and floating dependencies that resolved to whatever was the latest version on the day each cell was last executed. The team accepts the handoff at face value because the weights work and the eval scores are good and the registry tag is stable. Eighteen months later, the engineer departs. Six months after that, a base-model migration requires regenerating the adapter against an updated base, the notebook runs and produces weights that score three points lower and regress visibly on the hardest customer segment, and the team spends four months trying and failing to reproduce the original artifact.

The Model Registry Your Platform Team Built That Nobody Updated

· 12 min read
Tian Pan
Software Engineer

A platform team I know spent two quarters building a model registry. It had everything the org chart asked for: a promotion workflow from dev to staging to prod, a CODEOWNERS-style approval matrix, lineage tracking, eval-score gates, a deprecation policy with a 30-day window, and a Backstage tile that showed which version of every model was live in which service. They cut a launch announcement, ran a brown bag, and added a row to the compliance binder.

Six months later, the highest-traffic agent in the company was running on a model card whose "owner" field still pointed at someone who had left, whose eval score was from a benchmark the team had since deprecated, and whose "approved by" name was the platform tech lead — who had never used that agent, never read its eval set, and had pressed approve at 11:43pm on a Thursday because the producer had pinged him in DMs saying the launch was tomorrow.

The registry was not broken. The promotion gates fired. The audit log was intact. Everything the launch announcement had promised was true. And the org had less real oversight of its production models than it had had eighteen months earlier, when the same decisions were made by an ML engineer reading the eval output by hand before pasting the model URI into a config file.

The RAG Threshold Pinned to an Absolute Score the Embedding Upgrade Silently Moved

· 9 min read
Tian Pan
Software Engineer

A RAG pipeline ships with a reranker score threshold of 0.4. Anything below gets dropped from the prompt. Six months in, a routine index rebuild swaps the embedding model for a newer checkpoint in the same family — a transparent upgrade, the change log says. Two days later answer relevance falls 6%. The team blames the LLM, runs a model bake-off, finds no candidate that recovers the loss, and spends a quarter chasing a regression that lives in none of the models they were comparing.

The regression lives in the gate. The reranker — untouched, same checkpoint, same weights — is now scoring a different candidate set. The new embeddings pull different chunks into the top-50, the reranker scores them lower on its own calibration, and the gate at 0.4 drops 37% more candidates than it did the week before. The number 0.4 didn't change. What 0.4 meant changed.

The Thumbs-Up Button That Poisoned Your Eval Set Through the Back Door

· 11 min read
Tian Pan
Software Engineer

A thumbs-up button is the cheapest signal you will ever instrument. It is also one of the most dangerous, because nothing about it announces that it is reshaping the distribution your eval set is supposed to represent. The button is collected as a positive — the curation pipeline reads it as quality — and six months later the eval is dominated by examples chosen by a cohort that does not include the customers most likely to churn.

The failure rarely shows up as a regression. It shows up as a divergence: weekly eval trends up, the enterprise tier's NPS slides, and the team only diagnoses the gap when a churned account names the specific kind of question their team kept getting wrong. The eval set has no examples shaped like it. The signal you were optimizing was real. It was just measuring the wrong distribution.

The Bug Report Against a Model Version You No Longer Serve

· 11 min read
Tian Pan
Software Engineer

A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.

The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your v1-chat endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.

This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.

The Compliance Audit That Asked Which Model Produced Which Output

· 10 min read
Tian Pan
Software Engineer

The auditor's question sounds simple. She has your appeals log open, points at a row from eight months ago, and asks which model decided that case. Your engineer pulls up the schema: there is a model column, and every decision in the audit window says v1. Then someone from the platform team mentions, almost in passing, that the alias behind v1 rotated four times during the audit period — a base model upgrade, a fine-tune refresh, a vendor-side capacity move, and one rollback that lasted six hours during an incident. The honest answer is that you cannot say which checkpoint produced that decision. The auditor writes something down. That phrase is not a regulator-acceptable answer, and you have just learned that the system you shipped has been failing an audit requirement it was never designed to meet.

The gap here is not a missing log line. The gap is between two different ideas of what "model" means. To the engineers shipping the system, v1 is an endpoint — a stable contract callers can point at while the thing behind it gets upgraded for free. To the auditor, "the model that produced this decision" is a specific artifact: a weight checkpoint, a hash, a thing you could in principle re-run on the same input and get a defensibly similar output. Endpoint aliases were invented to hide checkpoint rotation from callers. Audit-grade provenance demands the opposite — that every decision be attributable to exactly the checkpoint that produced it. The two ideas were on a collision course from the start; the audit just happened to be where they met.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Fine-Tune Dataset You Accidentally Built While Debugging

· 9 min read
Tian Pan
Software Engineer

The thumbs-down button on your staging UI was supposed to do one thing: tell the on-call engineer which response looked bad so they could go investigate. Six months later, somebody on the modeling team pulled "all production feedback with corrections attached" into a Parquet file and ran an SFT job against it. The eval set improved on three metrics and regressed quietly on five. Nobody could explain why until somebody scrolled through the labels and found a row that read, in the corrections column, "this is fine but I hate how it phrases it." The model learned that opinion. Then it learned forty-thousand more of them.

This is the failure mode where the debugging surface and the curation surface are the same surface. Engineers click "bad" because something is broken, because something looks weird, because they were about to file a ticket, because the formatting offends them, because they were checking whether the button works. The signal that flows out of that click is a mix of "this output is wrong," "this output is right but ugly," "I don't like this," and "I was bored." Treated as a single label, it certifies nothing. Trained against, it teaches the model the union of all those moods.

The Fine-Tune That Erased the Alignment You Inherited

· 9 min read
Tian Pan
Software Engineer

You picked the base model "because it was the safer one." Six months later your team has shipped a domain-tuned checkpoint that answers customer questions about wealth products with reassuring fluency, passes the task eval at 94%, and — somewhere between epoch one and epoch four — quietly forgot how to refuse anything. Nobody noticed because your launch eval suite never measured what fine-tuning removed. The capabilities it stripped were never in your task distribution, so they were never on the dashboard.

This is the most under-reported failure mode in production LLM systems right now: post-training alignment is not a property of a model family. It is a property of one specific checkpoint, and supervised fine-tuning corrodes it by default. The team that fine-tuned has not shipped a tuned version of the model they reviewed. They have shipped a different model — one whose model card describes weights nobody is serving.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.