Agent prompts hide if-else branches your eval suite never executes. Borrow MC/DC discipline, instrument planner decisions with branch IDs, and gate prompt diffs on coverage so silent re-routings stop slipping into production.
Salience-weighted memory eviction looks like a quality win on day one and turns into a migration project at every model upgrade — here is why LRU is the boring choice that survives.
Agent workers accumulate ephemeral filesystem state — extracted PDFs, transcribed audio, cached attachments — on disks no one classified. The fix is naming the tier, not architectural sophistication.
AI engineering work doesn't fit deterministic packet templates. How to frame ramp-shipped, eval-driven, stochastic work so calibration committees can actually credit it.
Finite provider quota plus three product teams with launch deadlines is a budget allocation system. The team running your LLM gateway is the one being asked to allocate — usually without a policy, a sponsor, or the telemetry to defend a decision.
Deflection rates measure difficulty avoided, not difficulty removed. When AI handles the easy 80 percent of support tickets, the human queue becomes 100 percent edge cases — and the team feels it long before the dashboard does.
Long-running browser agents that reuse profiles to dodge cold-start cost can serve one tenant's session to another's request. The trace says success — and a different user's dashboard just got read.
Retiring an AI agent by deleting its code leaves OAuth tokens, service accounts, vector indices, and eval datasets rotting in production. The fix starts at launch, not sunset.
Once every candidate model scores 95+ on the same test cases, your eval suite has stopped measuring anything — the ruler outgrew the platform, not the other way around.
Golden eval sets are real customer queries paired with labeled correct answers — and most teams handle them as engineering fixtures, bypassing every privacy control built for the underlying production data.
Eval sets refreshed from production traces inherit a survivor bias: the users who hit the worst failures left and stopped generating traces. Scores climb while retention slips. Here is how to break the loop.
Your fallback path was supposed to fire on 0.5% of requests. It is now serving 38%. The fix is to treat tier mix as a first-class SLO.