Coding agents removed the code-writing constraint and dropped the load on the review queue. The team that ships agents without redesigning review will ship a backlog generator.
Single-axis rollback gives a clean false negative when an agent regression lives in the interaction between a new model and a new tool description. The fix is bisecting the cross-product of model, prompt, tool catalog, retrieval index, and sampling config — with a named version envelope as the unit of rollback.
Multimodal models silently fuse disagreeing vision and text channels into confident hybrid answers. Where the failure surfaces, why evals miss it, and how to build a disagreement primitive.
Most prompt evals score the happy path and miss the tail. Build a stress matrix that varies length, register, language, and formality — and grade degradation curves, not single accuracy numbers.
Prompt caching saves 80–90% on cached requests by sharing KV state across tenants — and turns time-to-first-token into a side channel that recovers other customers' prompts at 92% accuracy. The cost-vs-isolation trade-off most teams have not had.
Standard bounty rubrics break against AI features whose specified behavior is 'be helpful.' A working program needs a CIA-anchored severity rubric, a probabilistic reproducibility clause, an explicit tools-in-scope list, named test tenants under safe harbor, and an AI team bound to a mitigation SLA.
int4 quantization halves serving cost and barely moves median benchmarks — while quietly destroying rare-token completion, low-resource languages, and long-context reasoning. Here is why the cliff is invisible to the eval suite that approved the swap, and the rollout discipline that makes the regression visible before customers find it.
Provider model availability staggers by region. Your single-model abstraction silently bifurcates across continents, and the eval suite is the last place that will tell you.
Once a customer's data is in the loss function, deletion stops being a row operation and becomes a rebuild. Lineage chains, four policy choices, and the procurement clause that's now a stop-ship issue.
Production sampling configs accumulate undocumented temperature, top-p, and penalty values whose original justifications evaporate while their effects compound. A discipline for catching it.
Most agent frameworks silently clip tool outputs past a hidden byte or token cap. The model reasons over a fragment it can't see is a fragment, and the bug shows up months later as a customer escalation.
When one model writes the spec, the code, and the tests, 'all tests pass' stops being evidence the feature works — it only proves the model is internally consistent.