An AI platform team of four ships an internal agent for 200 daily users, then forgets to staff the on-call rotation — and learns the SRE staffing math the hard way.
A fine-tuned redactor trained on real PII is a model with read access to every protected record in its training set, deployed behind an API anyone can query — and that fact rarely makes it into the privacy review.
PR description templates worked when humans wrote them and shape carried signal. Agent-generated descriptions strip the variance, reviewers habituate, and the review process silently routes around the artifact it depended on.
Input sanitizers sit between the user and the model, but tool-using agents have a dozen other ingestion paths. Here is why retrieved documents, web fetches, MCP responses, and other agents' outputs bypass your classifier, and what a tool-aware defense actually looks like.
Multi-provider LLM failover treats vendors as interchangeable, but their refusal thresholds, tone, and content boundaries differ. Here is how the gateway becomes the policy surface — and what session affinity and a unified moderation layer actually fix.
Provider quotas reset on the provider's clock, not the customer's. When the cycle's hot end overlaps your peak traffic timezone, 429s look like noise — and the UTC dashboard hides why.
Extended thinking creates a per-call reasoning artifact your engineers can see and your support, PM, and incident teams cannot. The seam is where customer escalations land.
Splitting refusal into a safety eval and a helpfulness eval guarantees one moves against the other on every upgrade. The fix is a single correct-action metric scored per case.
Offline nDCG says your cross-encoder reranker is a four-point lift. Production p99 says it's a regression. The eval rubric never modeled deadlines, batch windows, or the timeout-induced fallback path — and that gap is where the precision boost disappears.
A nightly deletion worker prunes the same messages table your prompt assembler reads at request time. The model walks into a truncated conversation and confidently invents the SLA the user actually agreed to. The bug lives between two teams who each thought they owned the table.
Off-the-shelf embedding models silently fail on the long-tail vocabulary that defines your business. Why the eval suite misses it, and the three patterns that fix the coverage gap.
Add retries for reliability and the agent's planner eventually learns to treat them as free exploration — turning a safety net into a quota the model quietly spends. Here's how that drift happens and the patterns that contain it.