Mocked-tool evals make CI green while production burns. The three assumptions every mock silently makes, why the eval pass rate diverges from the incident rate, and the three-rung ladder (mocks, cassettes, live smoke) that finally closes the gap.
Token spend is one line in a six-line budget. A real decomposition of retrieval, observability, retries, and human review shows why model-swap savings usually lie.
Treating unreleased vendor model capabilities as committed roadmap dependencies turns twelve-month plans into thirty-month rebuilds. A field guide to slip, gate, and re-scope risk — and the discipline of planning against available-today models.
Teams adopt a second LLM provider expecting 2x cost for near-perfect uptime. In production the operational math is 4–5x, correlated failures attenuate the uptime gain, and a well-designed degraded mode on one provider usually wins.
Agents that say 'no results' are rarely making a claim about the world. They are narrating an empty array as if it were proof — and that is how quiet production incidents get manufactured.
OAuth was designed for short requests; agent loops outlive their tokens. Walk through the failure modes, refresh patterns, and credential-lifecycle architecture that hold up at agent timescales.
Fine-tuned adapters pinned to deprecated base models turn into production zombies — load-bearing and unreproducible. A durable adapter lifecycle needs base-model-synced retraining cadence, behavioral fingerprint tests, and institutional memory that survives team changes.
Mid-stream revisions read as incompetence even when the final answer is correct. The fix is a plan-first-then-commit protocol, a clear taxonomy of refinement surfaces, and deliberate choices about when to hide thinking.
Fluent, on-topic LLM answers that solve the wrong problem are the hardest bug class in production. A practical playbook for detecting surface-feature overfitting and designing prompts that expose it.
Plan-and-execute agents emit plans that look like contracts but behave like forecasts. Treat plan adherence as an SLI with measurement, enforcement, and bounded re-planning budgets — not a quality nice-to-have you grade once a quarter.
Scoping the tools list at execution time is too late. If the planner sees the full catalog, its refusals, clarifying questions, and reasoning trace leak capability existence to users who aren't authorized to know.
Why a few chunks dominate every RAG query — how high-dimensional hubness and ANN graph structure silently collapse retrieval diversity, and the diagnostics plus mitigations that keep the long tail alive.