Skip to main content

2 posts tagged with "replay"

View all tags

The Incident Ticket With No Repro Steps: Reproducibility as Something You Engineer

· 10 min read
Tian Pan
Software Engineer

The incident ticket is specific in the way only real incidents are. At 02:14 the support agent closed a customer account that should have been put on a 30-day grace period. The customer noticed. The ticket lands on your desk with a single line under "Steps to reproduce": unknown.

You open the trace. You can see the agent called close_account instead of set_grace_period. You can see the tool succeeded. What you cannot see is why the model chose that branch — and when you replay the same customer message through the same agent, it does the right thing. Twice. The postmortem now has a paragraph-shaped hole where the root cause should be, and the only honest thing you can write is "could not reproduce."

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

· 12 min read
Tian Pan
Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.