Cloud AI stacks treat outbound HTTPS as a free primitive. Pulling the cable forces every layer — model provenance, evals, fleet, telemetry — to grow primitives the cloud version quietly hides.
Provider availability is continuous, not binary. Your fallback chain handles the easy outage and misses the brownout that quietly drains user trust for hours.
Most agents either over-ask and exhaust users or over-guess and lose their trust. The fix is a per-task clarification budget plus a policy layer the model is structurally unqualified to own.
The embedding model sets the upper bound on RAG quality, and swapping LLMs can't move it. A practical framework for choosing one: domain match, dimensionality, multilingual behavior, and instruction tuning.
Why LLM evals only catch regressions when they live in the PR comment next to the diff. Lessons from how code coverage migrated from nightly job to inline review surface — and the four engineering pieces that turn eval-as-a-job into eval-as-a-merge-gate.
Eval scores climb while user complaints climb with them. An eval set built on launch-week traffic quietly stops measuring the product six months later — here is the shadow-set, resampling, and slicing discipline that keeps the dashboard honest.
Most LLM agent memory collapses four layers into two — a buffer and a vector store. Working, session, episodic, and semantic each need their own tier.
Multi-step agents look fast at the median and feel slow in the tail. Here is why composition punishes P50 dashboards, and how to design latency budgets that match what users actually experience.
Inference is 40-60% of an agent's true cost. The other half hides in vector DB, retrieval embeddings, telemetry, retries, evals, and human review — owned by no single team.
How 'stateless' AI tool calls quietly leak data across tenants through shared caches, vector stores, and memory modules — and the audit protocol that catches it before customers do.
Cookbook-style prompt folders break at scale. Apply monorepo discipline — semantic versioning, dependency graphs, atomic refactors, and eval gates — to keep prompt drift, phantom dependencies, and migration paralysis out of production.
Most production agents treat their tool set as an unordered bag of capabilities. It's actually a partial order, and the bugs live in the edges nobody declared.