Agent prompts and agent tools look like the same asset on disk, but they fail in completely different ways — and shipping them through one pipeline is the architectural mistake at the root of most agent incidents.
Temperature is not a global knob. Allocate randomness per surface — routing, parsing, synthesis, generation, exploration — or pay for variance you do not need.
Engineering orgs are quietly accumulating a second audience for their internal docs — every developer's AI assistant. The team that writes for only one of those readers is shipping the other broken.
Swapping an embedding model in production is not a batch job — it's a schema migration with semantic consequences. Why pointwise evals miss the regression, what dual-write windows and neighborhood-stability metrics actually buy you, and where the cost frame surprises teams.
An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Audit, rotate, and de-monoculture your benchmark before green CI becomes a self-flattering lie.
Prompt rewrites are the easy part of switching LLM providers. The eval harness is where the real lock-in lives — and the bill comes due the day you try to renegotiate.
Eval suites measure a quiet machine running serial calls against warm caches; production is a different system. Treat latency as a property of a deployment, not a model, or your p95 will lie.
The 47-criterion rubric your engineers wrote to make the LLM-as-judge work has quietly become your product specification. Every weight, every score boundary, every missing criterion is a product decision the PM never made on the record.
An LLM eval suite is a simulator. Skip the recalibration cycle and you ship six green releases against a dataset that stopped resembling production around month three.
Four eval signals on a prompt change rarely agree. Without a written hierarchy for which signal wins under which conditions, every release week becomes a debate about whose number to trust.
Few-shot examples are tuned to a specific model. After a model upgrade, the demonstrations that lifted accuracy can quietly start dragging it down — here is the audit and provenance discipline that prevents rot.
Every sufficiently capable model exposes behaviors your team never roadmapped. Users find them, build workflows on top, and treat the next model upgrade as a regression. Here's the product discipline that turns found capabilities into decisions you actually own.