When users build workflows on AI agent behaviors no test ever verified, you're shipping capabilities you cannot defend. A discipline for finding phantom skills before the next model upgrade silently removes them.
Production system prompts are three config files in a trench coat — conversational voice, output formatting, and refusal policy crammed into one artifact with one reviewer and one release cadence. Every policy edit becomes a behavioral regression on unrelated tasks. Here is the factoring that pays for itself.
Pre-launch fairness audits expire the moment a model meets real traffic. A practical playbook for the metrics, slice-level audits, regression gates, and monitoring infrastructure that catch AI bias drift before it reaches users.
Prompt edits look like English but behave like code. The review discipline — paired eval-and-prompt PRs, behavioral diff comments, split reviewer roles — that catches behavioral regressions before users do.
Pinning a model ID does not freeze behavior. Refusal thresholds and content classifiers move server-side without a release note, and the regression is asymmetric across the safety boundary.
Pure semantic retrieval ignores time, while recency-weighted retrieval rewards activity over correctness. A practical look at per-query time-sensitivity classifiers, per-document volatility scores, and the two-axis scoring that makes RAG correct on both stable and time-varying questions.
Most agent bugs live in the joints between model, tools, and harness — single-layer logs cannot see them. Build a unified trace, an OpenTelemetry GenAI span surface, a cause-hypothesis panel, and a reproducibility envelope to debug agents like the distributed systems they are.
Refusal rate is a two-sided distribution, but most safety dashboards plot only one side. Here is what to instrument, how to sample, and who should own the calibration.
When source documents disappear, their embeddings linger in the vector index and keep returning confidently wrong answers. A field guide to tombstones, cascade invalidation, and retrieval-time freshness checks.
One session_id column, three meanings — billing, eval, and memory each define a conversation differently, and a single default ships three unrelated bugs with the same root cause.
Most AI features ship with a visible reasoning trace because the model emits one and hiding it feels wasteful. It is a product decision the team never made — and a measurable source of trust loss.
Switching to a smaller model to cut cost-per-token can quietly raise your LLM bill. The right unit is cost per successful task, and most dashboards never measure it.