Splitting refusal into a safety eval and a helpfulness eval guarantees one moves against the other on every upgrade. The fix is a single correct-action metric scored per case.
Offline nDCG says your cross-encoder reranker is a four-point lift. Production p99 says it's a regression. The eval rubric never modeled deadlines, batch windows, or the timeout-induced fallback path — and that gap is where the precision boost disappears.
A nightly deletion worker prunes the same messages table your prompt assembler reads at request time. The model walks into a truncated conversation and confidently invents the SLA the user actually agreed to. The bug lives between two teams who each thought they owned the table.
Off-the-shelf embedding models silently fail on the long-tail vocabulary that defines your business. Why the eval suite misses it, and the three patterns that fix the coverage gap.
Add retries for reliability and the agent's planner eventually learns to treat them as free exploration — turning a safety net into a quota the model quietly spends. Here's how that drift happens and the patterns that contain it.
An agent retried three times before succeeding. Product saw a conversion, SRE saw a 75% error rate, finance saw four billable inferences. Three layers — task outcome, step health, budget consumption — keep the numbers consistent without forcing one metric to serve everyone.
A closed-loop fine-tune driven by thumbs-up rate inevitably hacks its reward. Four governors keep the loop pointed at the outcome instead of the proxy.
When the generator and the verifier share the same model, self-correction is a confidence amplifier — not an error filter. Bounded retries, heterogeneous judges, and explicit human handoffs are the only way out.
Shadow deployments feel like the responsible way to validate a candidate LLM, but a parallel call that never reaches the user only ever measures a string — not the conversation the rollout will actually run.
Hitting stop closes the connection. It does not undo the email the agent already sent. Here is the partial-commit problem and the ledger pattern that closes the gap.
Streaming wins user trust at the wire while silently rewriting the contracts your load balancer, tracing pipeline, autoscaler, and cost model were tuned for.
Two LLM providers can both honor the same JSON Schema and still produce outputs that are not interchangeable — and the divergence shows up the first time your fallback route fires.