Agent errors are not support escalations — they are billing events with a real owner. How to design a liability model, provenance trails, reversibility tiers, and mistake-cost evals before the first angry ticket.
An LLM caller has no inbox, no stable identity, and no obligation to read your migration guide. Here is why the fifteen-year-old API deprecation playbook fails for agents — and what to change in the tool schema, error messages, and gateway instead.
The engineer who holds your prompt and eval knowledge is being repriced by the market faster than your comp band moves. Here is why the generic IC ladder cannot see them, and what to change before they leave.
Long-term memory ships as a feature, but it is a cache of facts about a world that keeps changing. Without invalidation, provenance, and conflict rules, it becomes a slow-motion correctness bug.
Adding a confidence field to an LLM output feels free. It is not. Here is the per-request tax it imposes, why the number is rarely calibrated, and what to measure before routing production traffic on it.
Pruning an eval suite for speed or cost looks like maintenance, but every deleted case retires a guarantee the team can no longer see. Borrow the API deprecation lifecycle to retire eval cases deliberately.
An eval score is a lossy compression of AI quality, and the PM who owns the launch often can't decompress it. Here is the literacy bridge that keeps ship decisions anchored to the data instead of the loudest voice.
Retrying a timed-out LLM call does not re-fetch the same answer — it samples a new one. Here is why retry-on-timeout breaks against a nondeterministic backend, and how idempotency keys make it safe again.
A streaming endpoint commits to a 200 the moment the first token flushes, so every failure after that hides from your load balancer, retry middleware, and SLO dashboard. Here is how to make the body carry the verdict the header no longer can.
Streaming AI features have two latencies that diverge — time-to-first-token and time-to-completion — and most teams instrument only the one users feel least. Here is how to split the metric and the SLO.
AI capability probes quietly become roadmap commitments as 'it works once' travels through standup, roadmap, and a sales call. Here is the capability-test artifact and promotion gate that stop a demo from turning into a contract.
Step-through debugging breaks when inputs are stochastic. The replacement is trace-first and replay-based, with four affordances — timeline scrubbing, branch comparison, replay-with-perturbation, and per-step intent recovery — that look nothing like the IDE's debug toolbar.