Agent loops turn a 2% tool error rate into a 20% user-visible failure by multiplying retries across steps and SDK layers. Here is the math, the self-DoS pattern, and the retry budget discipline that stops it.
Filling an LLM's advertised context window wrecks accuracy at the right edge — the failure mode past 'lost in the middle,' with benchmarks, safety margins by task, and prompt fixes.
When most diffs in a repo start life as model output, reviewers anchor on 'looks plausible' and miss the semantic bugs that don't render as syntactic smell. The countermeasures, the disclosure question leadership has to answer, and the incident curve that catches up six months later.
Head-based and uniform-random sampling silently excise the rare catastrophic agent trajectories from your debug corpus. Tail sampling, anomaly-keyed retention, and per-failure-mode reservoirs build a debug dataset that actually contains the failures you need.
Semantic caches can serve another user's response in under a millisecond and your hit-rate dashboard will turn green doing it. The cache-key design, provenance envelope, and audit trail that prevent cross-user leak by construction.
Text-level diffs have almost no correlation with how an LLM's behavior changes. A three-word edit can flip 30% of outputs while a fifty-line restructure changes nothing. Here is how to build a semantic diff toolkit that PR reviewers can actually trust.
Pinning a model version buys short-term stability and quietly accrues deprecation debt. Scheduled re-qualification, drift monitoring against the next tier, and a dual-track prompt portfolio turn migrations into routine operations instead of fire drills.
Prompt-as-spec collapses under more than one author. A spec-first contract — inputs, outputs, invariants, errors, refusals, escalations — turns prompt edits into diffs, makes evals derivable, and shrinks owner onboarding from months to a week.
Synthetic preference data feels like a free lunch — until your product quietly starts sounding exactly like the teacher model you trained it from. A field guide to spotting, measuring, and bounding RLHF flavor drift.
Anomalous LLM token spend is the earliest signal of a compromised API key, prompt injection, or data exfiltration — but billing owns the dashboard and security owns the response. Here is how to wire them together.
Tool spec text is the prompt the model reads before deciding when to invoke. Treat it like a prompt — concrete use cases, negative examples, sibling disambiguation — not like OpenAPI docs.
Most agent teams measure tool-call success but never measure tool hallucination. Split the rate into three — unknown-tool, shadow-call, hallucinated-argument — and build the probe suite that catches each before production does.