Production agents confidently confirm actions that never happened because teams treat chat text as a contract instead of the tool call. A pattern for separating narration from commitment.
When a smarter model disagrees with the one you shipped, every durable agent decision becomes a contested record. A framework for eval, decision, and action replay — plus the architectural prerequisites and policy matrix you need before the next upgrade.
Model upgrades raise your aggregate pass rate while concentrating the residual failures on the hardest 5% of traffic — here's how stratified evals and capability-frontier probing expose the cliff before it lands in your on-call rotation.
Tool-level idempotency keys are not enough when a non-deterministic planner can re-emit the same action. The contract has to live at the orchestration boundary, keyed by structural run state — not by model-authored arguments.
Agent latency is a nested tree of planning calls, tool fan-outs, and sub-agents — flame graphs sorted by duration hide the critical path, so local optimizations miss the real budget violation. Here's how to budget, propagate deadlines, and observe slack the tree way.
Agent memory has two schemas — the store and the model's context — and only one of them migrates with a SQL script. Why protobuf's additive-only discipline is the right starting point, and what the shadow-write playbook needs on top.
Agents fail by continuing to talk. Confident prose papers over tool errors while writes never commit. The fix: demote the model's claim to a hypothesis, promote tool responses and post-action probes to authoritative signals, and measure effect landing instead of turn success.
Granting an agent PagerDuty access is an infra decision with product-team consequences. A control plane for human-facing tools — rate limits, dry-run, off-ramps — that prompts can't enforce.
Chat logs are ESI. Design retention in four tiers, build a hold registry before you need it, and tag provenance at ingestion — or pay for the same architecture in the middle of discovery.
Technical roles saw 48% AI-assisted cheating across 19,368 interviews and 61% of cheaters cleared the bar. A look at why detection cannot win, why no-AI policies punish honest candidates, and the interview formats replacing the broken ones.
Hosted tracing SDKs quietly ship full prompts and responses past your trust boundary. A compliance playbook for LLM teams: classify fields, scrub before egress, audit the SDK as policy.
Most struggling AI teams run frontier models on 2012-era operations. The next hire that fixes it is usually an SRE, not another applied scientist.