Five layers of implicit time inside every LLM prompt — and why the layers silently disagree the moment a request is replayed, batched, or evaluated against a pinned snapshot.
When a model swap preserves your structured-output schema but changes token pacing, pause patterns, and intermediate phrasing, you ship a breaking change to a contract you never wrote down.
A structured way to negotiate latency targets with product before they become commitments — the conversion table, the pick-two frame, and why TTFT is usually the number that matters.
Why a 200ms MCP tool call becomes a 4-second agent loop, where the cold-start tax actually lives, and the warm-pool discipline that turns multi-second penalties into sub-100ms ones.
Coding agents removed the code-writing constraint and dropped the load on the review queue. The team that ships agents without redesigning review will ship a backlog generator.
Single-axis rollback gives a clean false negative when an agent regression lives in the interaction between a new model and a new tool description. The fix is bisecting the cross-product of model, prompt, tool catalog, retrieval index, and sampling config — with a named version envelope as the unit of rollback.
Multimodal models silently fuse disagreeing vision and text channels into confident hybrid answers. Where the failure surfaces, why evals miss it, and how to build a disagreement primitive.
Most prompt evals score the happy path and miss the tail. Build a stress matrix that varies length, register, language, and formality — and grade degradation curves, not single accuracy numbers.
Prompt caching saves 80–90% on cached requests by sharing KV state across tenants — and turns time-to-first-token into a side channel that recovers other customers' prompts at 92% accuracy. The cost-vs-isolation trade-off most teams have not had.
Standard bounty rubrics break against AI features whose specified behavior is 'be helpful.' A working program needs a CIA-anchored severity rubric, a probabilistic reproducibility clause, an explicit tools-in-scope list, named test tenants under safe harbor, and an AI team bound to a mitigation SLA.
int4 quantization halves serving cost and barely moves median benchmarks — while quietly destroying rare-token completion, low-resource languages, and long-context reasoning. Here is why the cliff is invisible to the eval suite that approved the swap, and the rollout discipline that makes the regression visible before customers find it.
Provider model availability staggers by region. Your single-model abstraction silently bifurcates across continents, and the eval suite is the last place that will tell you.