A one-line debugging change added a timestamp to a summary cache key and silently tripled the LLM bill for two weeks — a study in why cache keys are a contract, not plumbing, and why hit rate must be a first-class signal.
Per-feature dashboards track token spend. Per-provider dashboards track invoices. The quarterly embeddings re-index falls between both and lands in an unowned infrastructure bucket — where 40% of AI spend goes to die unreviewed.
Regional API endpoints commit to where your request goes, not where the cached prefix bytes that satisfy it live. The auditable boundary and the cache placement boundary are governed by different SLAs — and the gap is where compliance posture breaks.
A field your platform team added for incident triage ended up in a tenant audit export — the disclosure required no attacker, just two correct decisions composing into a leak.
A provider can amend a deprecation date in place with no diff and no notification. The team that filed the original date in the deferred bucket finds out the way support tickets find them.
Hitting stop closes the LLM stream cleanly. It does not stop the HTTP request the tool already opened to a third party that has no idea the conversation ended. Here is why AbortSignal stops at the socket, and what to build at the commit boundary instead.
A deprecated embedding endpoint that quietly routes to a 'compatibility' successor can halve your retrieval recall without a deploy. Here's why query/document embedding mismatch is the silent killer of RAG, and how to pin endpoints to the corpus they produced.
An eval suite that grades the wrong prompt version reports green on a broken release. The fix is not faster cache invalidation — it is content-addressed prompt hashes that make eval/prod drift impossible to express.
An equal-weight composite of helpfulness, clarity, empathy, and accuracy will quietly reward hedged wrongness over blunt correctness. Here is why the dashboard goes green while the product regresses, and the rubric patterns that put the gradient back where you wanted it.
A team-level data leak hides inside an improving eval dashboard when prompt engineers reuse curated eval examples as few-shot demonstrations. Here is why the contamination is invisible, what independence actually requires, and who has to be empowered to say no.
Failover keeps your LLM app available when the primary goes down — but the fallback model reads a system prompt that was tuned for someone else, and your users notice the difference.
Few-shot examples are not neutral demonstrations — they are case law. Models bind to the closest example by surface tokens and inherit its constraints, shipping confident-wrong answers eval suites cannot see.