Skip to main content

2 posts tagged with "drift"

View all tags

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.