Semantic caches can serve another user's response in under a millisecond and your hit-rate dashboard will turn green doing it. The cache-key design, provenance envelope, and audit trail that prevent cross-user leak by construction.
Text-level diffs have almost no correlation with how an LLM's behavior changes. A three-word edit can flip 30% of outputs while a fifty-line restructure changes nothing. Here is how to build a semantic diff toolkit that PR reviewers can actually trust.
Pinning a model version buys short-term stability and quietly accrues deprecation debt. Scheduled re-qualification, drift monitoring against the next tier, and a dual-track prompt portfolio turn migrations into routine operations instead of fire drills.
Prompt-as-spec collapses under more than one author. A spec-first contract — inputs, outputs, invariants, errors, refusals, escalations — turns prompt edits into diffs, makes evals derivable, and shrinks owner onboarding from months to a week.
Synthetic preference data feels like a free lunch — until your product quietly starts sounding exactly like the teacher model you trained it from. A field guide to spotting, measuring, and bounding RLHF flavor drift.
Anomalous LLM token spend is the earliest signal of a compromised API key, prompt injection, or data exfiltration — but billing owns the dashboard and security owns the response. Here is how to wire them together.
Tool spec text is the prompt the model reads before deciding when to invoke. Treat it like a prompt — concrete use cases, negative examples, sibling disambiguation — not like OpenAPI docs.
Most agent teams measure tool-call success but never measure tool hallucination. Split the rate into three — unknown-tool, shadow-call, hallucinated-argument — and build the probe suite that catches each before production does.
The most dangerous bug in a production agent isn't the one that throws — it's the one where the tool description promises a field the backend renamed two sprints ago, and the model keeps reasoning as if nothing changed.
Tool outputs share a token stream with the system prompt, so every read-tool is a prompt-injection surface. Here is the trust-boundary model, the four production patterns, and the eval harness that actually measures whether your defenses hold.
Agent tool schemas live in two places at once — the runtime spec and the model's in-context memory. Renaming a parameter breaks both in different ways. Here is the deprecation playbook.
p50 and p99 total latency miss the single number that governs how your AI product feels: time to first token. Here is why reasoning models make it worse, what to measure, and how to route around it.