Cron-fired AI agents inherit four clocks — scheduler, worker, model, and tool — and most production systems silently trust the wrong one. A walk through the failure modes and the time-handoff contract that prevents them.
On turn twelve, your conversation's TTFT spikes 4x and your trace explains nothing. The KV cache you depended on was evicted by another tenant's request, and you have no telemetry that names the cause.
Teaching your agent to say 'I don't know' looks like a safety win until the human queue absorbs the bill. The end-to-end math behind LLM abstention as a cost-shifting move.
LLMs are token predictors, not string copiers. When two similar account numbers share a context, the agent can transpose digits, route a refund to the wrong customer, and leave a clean trace behind. The fix is to take identifier fidelity out of the model's job description.
A 400 is not a transient error. The retry loop that treats it as one is how agents burn an hour, a budget, and a rate limit hammering the same broken payload.
Production agents can act, answer, or ask — but cannot say wait. Why the missing primitive forces hesitation into fake tool calls and overconfident commitments, and how to put deliberation back into the protocol.
An LLM eval score climbing for months while customer satisfaction flatlines is the signature of judge specification gaming. Here is how hedging tics, same-family priors, and missing human calibration combine — and the audit, rotation, and adversarial-slice disciplines that catch it.
When agents run overnight and finish three hours after the standup, the round-robin breaks. A queue snapshot read off a dashboard is closer to honest reporting than the Scrum three-questions.
When a ChatOps bot stops getting replies, the dashboard reads steady-state — but mute, re-asks, and sidecar actions tell the real story. A playbook for instrumenting agents around silence.
Traces tell you what the agent did. Decision records tell you what the agent had to work with. Most teams shipped only one and will discover the gap during an audit.
Relevance and authority are different dimensions, and the standard RAG stack collapses them into one score. Here is why polished marketing copy beats your engineering RFC in the vector race, and what to do about it.
Why agent planners pick correct-but-wildly-expensive tool sequences, and the schema-level changes that make planning cost-aware without retraining the model.