Practical guides on building autonomous AI systems, scaling engineering teams, and technical leadership.
Putting 'I don't know' in the system prompt makes abstention untestable, unowned, and unscalable. Move it to the router and you get an SLO, an eval, and a real escalation path.
Agents inherited the broadest OAuth scopes the platform would issue, then drifted on a prompt — bringing back the privileged service account the security org spent a decade killing. A field guide to per-tool scoping, JIT credentials, action-level audit, and the IAM owner who owns the join.
Most production agents have a degraded-mode spec — it just lives in scattered catch blocks, untested, and the customer writes the public version of it on the next bad day.
Agent runtimes hide state in places your DR runbook never named. The fix: name the state surface, generate idempotency keys at task scope, checkpoint before every tool call, and default to fail-safe abort over fail-forward replay.
When an agent issues a wrong refund, your CRO will ask what produced it — and the answer requires a captured-at-write-time tuple of prompt, model id, decode config, tool results, and conversation history. Here is the discipline that makes 'we can reconstruct it' a true statement.
AI threat models usually stop at the model and treat output as safe content. Indirect prompt injection turns rendered markdown, structured output, generated code, and tool-call arguments into attack payloads — and the boundary worth defending is downstream of the model.
A permission prompt is a security control with a measurable half-life. Track per-user approval rate, tier friction by blast radius, and stop letting a 100% click-through rate carry your safety story.
Request-level sampling policies break for agent traces. A per-tier policy — always-trace failures, head-sample successes, tail-sample by cost percentile — turns the trace store from a budget hole into an incident-response tool.
A four-line bug fix gets three rounds of code review. A forty-line system-prompt edit ships with a single LGTM. A field guide to closing the discipline gap on AI artifacts before it ships your next regression.
The wow demo was one realization out of thousands the model would generate against the same input. The rollout craters not because polish is missing — because nobody measured variance. Here's the n-of-k sampling, worst-case input library, and distribution-shift checklist that close the gap.
AI features compose through artifacts nobody catalogs — prompt fragments, eval seeds, judge rubrics. When a shared edit lands, three other teams regress and nobody can attribute it. Here's how to draw the graph.
When the prompt changes and the help-center article doesn't, your AI feature's trust contract breaks silently — and the prompt repo can predict the gap.