Why LLM evals only catch regressions when they live in the PR comment next to the diff. Lessons from how code coverage migrated from nightly job to inline review surface — and the four engineering pieces that turn eval-as-a-job into eval-as-a-merge-gate.
Eval scores climb while user complaints climb with them. An eval set built on launch-week traffic quietly stops measuring the product six months later — here is the shadow-set, resampling, and slicing discipline that keeps the dashboard honest.
Most LLM agent memory collapses four layers into two — a buffer and a vector store. Working, session, episodic, and semantic each need their own tier.
Multi-step agents look fast at the median and feel slow in the tail. Here is why composition punishes P50 dashboards, and how to design latency budgets that match what users actually experience.
Inference is 40-60% of an agent's true cost. The other half hides in vector DB, retrieval embeddings, telemetry, retries, evals, and human review — owned by no single team.
How 'stateless' AI tool calls quietly leak data across tenants through shared caches, vector stores, and memory modules — and the audit protocol that catches it before customers do.
Cookbook-style prompt folders break at scale. Apply monorepo discipline — semantic versioning, dependency graphs, atomic refactors, and eval gates — to keep prompt drift, phantom dependencies, and migration paralysis out of production.
Most production agents treat their tool set as an unordered bag of capabilities. It's actually a partial order, and the bugs live in the edges nobody declared.
Most production agents are background jobs cosplaying as chats. Here's why scheduled triggers, checkpointed state, and bounded envelopes outperform conversational loops on cost, reliability, and operability.
Provider model bumps carry no behavioral compatibility guarantee, so every version change should run through the same staged rollout as a database migration — pinned eval, shadow traffic, canary, and a real rollback path.
Putting 'I don't know' in the system prompt makes abstention untestable, unowned, and unscalable. Move it to the router and you get an SLO, an eval, and a real escalation path.
Agents inherited the broadest OAuth scopes the platform would issue, then drifted on a prompt — bringing back the privileged service account the security org spent a decade killing. A field guide to per-tool scoping, JIT credentials, action-level audit, and the IAM owner who owns the join.