Cost-aware LLM routing makes the cheap model the actual product surface for most users. If your eval discipline still points at the flagship, you are flying blind on 70% of traffic — here is the router-as-product framing that fixes it.
Agent harnesses that propagate temperature down the call tree turn the planner's creativity knob into the verifier's bug. Per-role sampling profiles, default-deny inheritance, and the disagreement-rate eval that catches the leak.
Frameworks ship session-ids; users live in tasks. The gap between them is where half of agent UX disappears, and the fix is a task-id, not longer sessions.
Production-trace eval pipelines accumulate PII no one promised users would be processed this way. The fix is sanitization at the write boundary, schema-typed spans, and tag-based retention — not regex scrubbers at read time.
MCP made it trivially cheap to wire a developer laptop into prod-adjacent systems. The artifact is a loopback socket using credentials the engineer already has — invisible to procurement, CASB, and SSO logs. The discovery and governance discipline that has to land before the first breach disclosure.
Centralizing a safety preamble looks like a clean DRY win until the first edit ships and thirty consumer teams' evals tank. Here's why shared prompts behave like distributed systems, and the governance scaffolding that survives the first flag day.
Speculative decoding promises identical model output at 3-6x speedup, but that guarantee binds tokens leaving the inference engine — not bytes already shown to the user. When you stream draft tokens before verification, rejected suffixes have to be retracted, and which surfaces tolerate retraction is a product decision the inference team rarely scopes.
DAU, conversion, and retention were built for click streams. AI features emit task arcs — request, response, follow-up, resolution — and the dashboard you imported from the deterministic playbook will tell you the feature is winning while users route around it.
Vendor stop_reason values give you four buckets when production triage needs eight. Here is how to build the parallel stop-taxonomy that turns a black-box termination into a debuggable signal.
JSON.parse is all-or-nothing, but LLM token streams are not. Why streaming structured output is one design problem the API and the SDK have to solve together — and what a real partial parser must do.
Most agent frameworks run parallel tool calls as detached goroutines, then rediscover the failure modes structured concurrency solved two decades ago — partial failure, honored cancellation, runaway cost.
Single-turn evals miss the multi-turn failure modes that matter. LLM-driven user simulators with personas, patience budgets, and abandonment thresholds run thousands of conversations a night — but only when the simulator-vs-production gap is calibrated, not assumed.