LLM-as-judge agreement with humans is highest in the muddy middle and collapses at the decision boundary. The discipline that keeps the unlock honest: per-slice kappa, drift dashboards, cross-family ensembles for high-stakes slices, and an explicit ceiling past which humans grade.
A patch bump on your model SDK can quietly rewrite prompt behavior, break JSON parsing, and ship regressions past your eval gate. Here is the discipline that catches it.
Traditional APMs were built for bounded dimensions and stateless services. LLM workloads have a cardinality profile closer to product analytics, and the mismatch silently strips the only signal that would surface a broken prompt.
A shared prompt library quietly accretes model-specific forks that nobody tracks, breaking the contract between your eval suite and your routing layer at every model upgrade.
Rolling back an LLM upgrade isn't a button press — it's a partial, hysteretic operation closer to a database migration. Here's the control plane your incident playbook needs before the next bad model rolls out.
Routing 60% of LLM traffic to a cheaper model bends the cost graph — and silently splits your AI feature into two products. The aggregate accuracy metric averages over the segment that gets hurt, two failure modes show up as one bug report, and customers experience two assistants with no release notes.
Your English eval suite cost $40K. The seven-locale international launch will not cost $280K — the real curve is closer to N×L^1.3 because cross-locale comparison is a meta-eval that doesn't decompose.
Shared on-call rotations break the moment one of those services is an LLM-backed feature. Here is the literacy prerequisite, dashboard hygiene, and shadow-period playbook that keeps the AI team out of bed at 2am.
Shipping one on-device model to every user means you're either burning battery on flagships or shipping a degraded product on the long tail. The discipline that fixes it looks more like a CDN than a model registry.
Tools that return unbounded lists turn agents into the SELECT * antipattern of the function-calling era. Pagination is a load-shedding primitive — make it a convention in your tool catalog, not a per-tool decision.
Vector stores ship without the migration tooling Postgres has had for two decades — no ALTER TABLE, no online schema change, no per-row version. The discipline that makes embedding upgrades survivable starts with a single column most teams forget to add.
Prompt caching's discount is real until one tenant's launch evicts everyone else's prefixes. The shared inference cache is a tenant-coupling surface, and the bill lands weeks after the incident.