A single autonomy switch is the wrong abstraction for coding agents. Map each tool to a blast-radius tier, scale gates to the tier, and match agent velocity to your rollback velocity.
When a vendor renames a tool response field, your agent doesn't crash — it adapts and ships a degraded answer. Why microservices contract testing has to migrate to the agent stack, and how to wire it in.
Production LLM logs answer 'what did the model say' well and 'what did the model see' poorly — and that gap is what breaks model-migration evaluations months later. A practical schema for replayable traces.
Agent prompts and agent tools look like the same asset on disk, but they fail in completely different ways — and shipping them through one pipeline is the architectural mistake at the root of most agent incidents.
Temperature is not a global knob. Allocate randomness per surface — routing, parsing, synthesis, generation, exploration — or pay for variance you do not need.
Engineering orgs are quietly accumulating a second audience for their internal docs — every developer's AI assistant. The team that writes for only one of those readers is shipping the other broken.
Swapping an embedding model in production is not a batch job — it's a schema migration with semantic consequences. Why pointwise evals miss the regression, what dual-write windows and neighborhood-stability metrics actually buy you, and where the cost frame surprises teams.
An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Audit, rotate, and de-monoculture your benchmark before green CI becomes a self-flattering lie.
Prompt rewrites are the easy part of switching LLM providers. The eval harness is where the real lock-in lives — and the bill comes due the day you try to renegotiate.
Eval suites measure a quiet machine running serial calls against warm caches; production is a different system. Treat latency as a property of a deployment, not a model, or your p95 will lie.
The 47-criterion rubric your engineers wrote to make the LLM-as-judge work has quietly become your product specification. Every weight, every score boundary, every missing criterion is a product decision the PM never made on the record.
An LLM eval suite is a simulator. Skip the recalibration cycle and you ship six green releases against a dataset that stopped resembling production around month three.