Skip to main content

5 posts tagged with "reproducibility"

View all tags

The Fine-Tune Artifact Your Departing Engineer Took With Them

· 12 min read
Tian Pan
Software Engineer

A fine-tune is not a file. It is the closure of a pipeline over a training set, and the team that ships the file without the closure has built a production dependency whose source code is in someone else's head. The day that person leaves with two weeks of notice and a clean handoff document is the day your bus factor on a revenue feature drops to zero and nobody notices, because the weights are still in the registry and the registry tag is still stable and the model still serves traffic. The reckoning shows up later, in a routine base-model migration that should have taken a sprint and takes a quarter instead.

The pattern is consistent across teams I have watched run into it. An ML engineer spends six months iterating on a fine-tune — data curation, hyperparameter sweeps, behavioral patches evaluated by feel against a held-out set. The final adapter weights get pushed to the model registry with a tag. The training pipeline that produced those weights is a notebook on the engineer's laptop, with hard-coded paths and floating dependencies that resolved to whatever was the latest version on the day each cell was last executed. The team accepts the handoff at face value because the weights work and the eval scores are good and the registry tag is stable. Eighteen months later, the engineer departs. Six months after that, a base-model migration requires regenerating the adapter against an updated base, the notebook runs and produces weights that score three points lower and regress visibly on the hardest customer segment, and the team spends four months trying and failing to reproduce the original artifact.

The Silent Personalization Layer Your Customers Could Not Reproduce

· 11 min read
Tian Pan
Software Engineer

A platform team ships a quality improvement. An inference-time layer reads the user's recent interactions and silently nudges the response style: more formal here, more terse there, more technical when the history suggests an engineer is asking. The A/B test shows an aggregate satisfaction lift of a couple of points. The launch post goes out under the heading "smarter responses, no API changes required." Nobody flips a flag in the API. Nobody updates the docs. Nothing in the response payload indicates which persona the model just adopted.

Six weeks later an enterprise customer files a support ticket that says, "your model is worse than you advertised." Their internal eval suite — running the same prompts your team published benchmarks against — scores eight points lower. Your team's first move is to verify prompt parity. Prompts match exactly. Decoding parameters match. The model version string matches. The divergence traces to the personalization layer, which infers a "thin-history default persona" for the customer's freshly-provisioned test account and a richer one for the long-lived user accounts your benchmarks were measured against. The conversation about whether the personalization is a feature or a bug stops being a product decision and becomes a contract negotiation.

The Deterministic Seed Your Eval Suite Set That Your Provider Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

You set seed=42. You set temperature=0. You logged the run, posted the dashboard, signed off on the model swap. The next morning the rerun returned a different number on the same prompts, and the explanation you reached for — "must be sampling noise" — was wrong twice over: there was no sampling, and the noise was structural. The seed left your client, the gateway threw it away, the kernel batched your request next to seventeen unrelated ones, and the floating-point reduction order changed under you. Your "reproducible" benchmark was always within one batch of being a different benchmark.

This failure mode is quiet because every layer in the stack is technically correct. The SDK accepts the seed. The provider documents the seed. The model returns a system_fingerprint. The eval harness logs all three. Nothing 5xx's, nothing warns, nothing protests. The number on the dashboard just shifts, and the team rationalizes the shift as the kind of jitter that always existed — because they have no instrument that can tell them whether they're looking at stochastic decoding or at a backend rotation that invalidated three weeks of comparisons.

The Incident Ticket With No Repro Steps: Reproducibility as Something You Engineer

· 10 min read
Tian Pan
Software Engineer

The incident ticket is specific in the way only real incidents are. At 02:14 the support agent closed a customer account that should have been put on a 30-day grace period. The customer noticed. The ticket lands on your desk with a single line under "Steps to reproduce": unknown.

You open the trace. You can see the agent called close_account instead of set_grace_period. You can see the tool succeeded. What you cannot see is why the model chose that branch — and when you replay the same customer message through the same agent, it does the right thing. Twice. The postmortem now has a paragraph-shaped hole where the root cause should be, and the only honest thing you can write is "could not reproduce."

Deterministic Replay: How to Debug AI Agents That Never Run the Same Way Twice

· 11 min read
Tian Pan
Software Engineer

Your agent failed in production last Tuesday. A customer reported a wrong answer. You pull up the logs, see the final output, maybe a few intermediate print statements — and then you're stuck. You can't re-run the agent and get the same failure because the model won't produce the same tokens, the API your tool called now returns different data, and the timestamp embedded in the prompt has moved forward. The bug is gone, and you're left staring at circumstantial evidence.

This is the fundamental debugging problem for AI agents: traditional software is deterministic, so you can reproduce bugs by recreating inputs. Agent systems are not. Every run is a unique snowflake of model sampling, live API responses, and time-dependent state. Without specialized tooling, post-mortem debugging becomes forensic guesswork.

Deterministic replay solves this by recording every source of non-determinism during execution and substituting those recordings during replay — turning your unreproducible agent run into something you can step through like a debugger.