Why uniform sampling defaults break for agent workloads, and the four decisions — start, end, stratification, retention — that determine your trace bill and your incident MTTR.
Copilot accept rates measure the frictionless half of the loop. The ROI lives in the validation tax that follows — here are the metrics that actually tell the truth.
Prompt config repos behave like feature-flag services at runtime but lack exposure tracking, audit logs, rollback telemetry, and per-user gating. Here is the governance gap and how to close it.
A prompt refactor that lifts accuracy three points can quietly destroy the hedging signal users trust. Measure calibration with ECE, reliability diagrams, and refusal rates — not just accuracy.
Production agents confidently confirm actions that never happened because teams treat chat text as a contract instead of the tool call. A pattern for separating narration from commitment.
When a smarter model disagrees with the one you shipped, every durable agent decision becomes a contested record. A framework for eval, decision, and action replay — plus the architectural prerequisites and policy matrix you need before the next upgrade.
Model upgrades raise your aggregate pass rate while concentrating the residual failures on the hardest 5% of traffic — here's how stratified evals and capability-frontier probing expose the cliff before it lands in your on-call rotation.
Tool-level idempotency keys are not enough when a non-deterministic planner can re-emit the same action. The contract has to live at the orchestration boundary, keyed by structural run state — not by model-authored arguments.
Agent latency is a nested tree of planning calls, tool fan-outs, and sub-agents — flame graphs sorted by duration hide the critical path, so local optimizations miss the real budget violation. Here's how to budget, propagate deadlines, and observe slack the tree way.
Agent memory has two schemas — the store and the model's context — and only one of them migrates with a SQL script. Why protobuf's additive-only discipline is the right starting point, and what the shadow-write playbook needs on top.
Agents fail by continuing to talk. Confident prose papers over tool errors while writes never commit. The fix: demote the model's claim to a hypothesis, promote tool responses and post-action probes to authoritative signals, and measure effect landing instead of turn success.
Granting an agent PagerDuty access is an infra decision with product-team consequences. A control plane for human-facing tools — rate limits, dry-run, off-ramps — that prompts can't enforce.