Six weeks of prompt tweaks and the eval score still oscillates in a four-point band? You're at a local maximum. Here's how to diagnose which ceiling you've actually hit — prompt, retrieval, model, spec, or data — and pick the lever that breaks it.
Long conversations quietly erode the system prompt. Why agent personas drift after 8–12 turns, how to measure the half-life, and the reinforcement patterns that hold.
Application logs say the request went to eu-west-1. The provider routed it through US and Singapore during a failover. Build the per-request sovereignty path that turns an audit into a query.
Most retrieval failures are query-shape failures, not embedding failures. A walkthrough of HyDE, decomposition, multi-query fan-out, and rank fusion — and how to diagnose which one your RAG pipeline actually needs before you reach for an encoder swap.
Retiring an AI feature is a migration, not a toggle. A field guide to grandfathering co-authored artifacts, separating writers from readers, and writing exit criteria at launch.
Span names and attribute keys are an undocumented API with consumers across cost dashboards, eval pipelines, and SLO alerts. Treat telemetry like a versioned schema or keep getting paged for breaks that fail silently.
A frozen golden set scores green while production satisfaction quietly collapses — because the users moved, not the model. Detection patterns and an eval-set shelf-life discipline.
System prompts and YAML manifests stop scaling once tool catalogs grow. A dedicated policy engine — OPA with Rego, or AWS Cedar — sitting in front of every tool call gives agents the auditable Policy Decision Point that prompt engineering can't provide.
Why uniform sampling defaults break for agent workloads, and the four decisions — start, end, stratification, retention — that determine your trace bill and your incident MTTR.
Copilot accept rates measure the frictionless half of the loop. The ROI lives in the validation tax that follows — here are the metrics that actually tell the truth.
Prompt config repos behave like feature-flag services at runtime but lack exposure tracking, audit logs, rollback telemetry, and per-user gating. Here is the governance gap and how to close it.
A prompt refactor that lifts accuracy three points can quietly destroy the hedging signal users trust. Measure calibration with ECE, reliability diagrams, and refusal rates — not just accuracy.