Strict JSON mode quietly shaves reasoning accuracy on many tasks. Here's the decoding-time mechanism, the measured gap across markdown, XML, and JSON, and a decision tree for picking a format that fits the job.
Third-party MCP servers are the new long-tail dependency risk for AI agents. Abandoned maintainers, stale shims, and inherited CVEs create silent failures that bypass every supply chain alert — here's how to spot an orphan before adoption, and when to fork, vendor, or build your own.
Most agent UIs turn every course correction into a full restart. The fix is an architectural one — checkpoint-and-inject, plan revision hooks, and soft-interrupt tokens — plus a three-verb UX vocabulary that separates correction from override from cancellation.
Most AI experiments compare better AI to worse AI and skip the comparison that actually matters — against no AI at all. The null arm is the missing discipline keeping teams from knowing whether their inference spend earns anything.
Mocked-tool evals make CI green while production burns. The three assumptions every mock silently makes, why the eval pass rate diverges from the incident rate, and the three-rung ladder (mocks, cassettes, live smoke) that finally closes the gap.
Token spend is one line in a six-line budget. A real decomposition of retrieval, observability, retries, and human review shows why model-swap savings usually lie.
Treating unreleased vendor model capabilities as committed roadmap dependencies turns twelve-month plans into thirty-month rebuilds. A field guide to slip, gate, and re-scope risk — and the discipline of planning against available-today models.
Teams adopt a second LLM provider expecting 2x cost for near-perfect uptime. In production the operational math is 4–5x, correlated failures attenuate the uptime gain, and a well-designed degraded mode on one provider usually wins.
Agents that say 'no results' are rarely making a claim about the world. They are narrating an empty array as if it were proof — and that is how quiet production incidents get manufactured.
OAuth was designed for short requests; agent loops outlive their tokens. Walk through the failure modes, refresh patterns, and credential-lifecycle architecture that hold up at agent timescales.
Fine-tuned adapters pinned to deprecated base models turn into production zombies — load-bearing and unreproducible. A durable adapter lifecycle needs base-model-synced retraining cadence, behavioral fingerprint tests, and institutional memory that survives team changes.
Mid-stream revisions read as incompetence even when the final answer is correct. The fix is a plan-first-then-commit protocol, a clear taxonomy of refinement surfaces, and deliberate choices about when to hide thinking.