The Support Runbook Your Humans Wrote That Your Support Agent Could Not Parse
A senior support engineer at your company opens a ticket the AI agent already closed and finds the agent's summary: "Resolved — confirmed billing in Stripe, escalated to AE per enterprise policy, refunded $48." Every clause is plausible. None of them happened. There is no tool named check_stripe. There is no tool that looks up customer tier. The "AE" the summary mentions does not work the account anymore. The agent did not call any of the tools it claimed; it generated the summary by paraphrasing the same playbook the engineer reads every Monday. The customer is still waiting.
The runbook the agent read was correct. The customer-success team had spent two years tuning it. Senior engineers had used it to onboard juniors. It said exactly what a human would do: if the customer mentions billing, check Stripe; if they're enterprise, ping the AE first; if it's urgent, escalate. The agent's failure was not that it ignored the runbook. The agent's failure was that it parsed the runbook the way a human reader would — by filling in everything the runbook did not explicitly say — and then acted on the fill-in as if it had been written down.
This is the part that breaks most production support agents in 2026: not the model, not the tools, not the data — the docs. Customer-success teams have fifty years of accumulated playbooks written for one audience. The agent platform team pipes those same playbooks into a model that reads them as a different audience. Neither team owns the translation. The mistranslation ships in every ticket.
Human runbooks are written in implicit-knowledge shorthand
"Check Stripe" is not an instruction. It is a pointer to a procedure that lives in the senior engineer's head: open Stripe, search the customer's company-name (not the user's email, because the company often signs the contract under a different legal entity), scan the last 90 days of invoices, look for disputed charges, look for failed-card retries, cross-reference against the support ticket's timestamps. A junior who has been on the team for a month performs the same procedure within a week and stops noticing they are doing it. A senior who wrote the runbook five years ago wrote "check Stripe" because anything more granular would have been condescending to the reader they pictured.
The agent reads the same two words and synthesizes a tool call. If the catalog has a tool named lookup_invoices_by_customer, the agent may invoke it; if it doesn't, the agent has a non-trivial probability of inventing one. This is documented across multiple agent frameworks — when the prompt mentions a domain action without naming the exact tool, the model hallucinates a tool name derived from the prose (check_stripe, $ARTIFACT_TOOL, lookup_user_tier) and proceeds as if the call returned data. The downstream summary is then written as if the hallucinated call succeeded.
The same pattern shows up for decision predicates. "If they're enterprise" is a predicate the senior engineer evaluates by glancing at the CRM's tier badge. The runbook does not say "call get_customer_tier(account_id) and branch on the string enterprise." The agent has three options: hallucinate the call, skip the predicate, or guess the answer from prose context (the customer's email domain, the ticket's wording). All three are wrong in the same way — they look like the right step in the summary but are not grounded in any tool output.
Runbooks are now a multi-audience artifact
The architectural fact the support team has not absorbed yet is that the runbook is no longer a one-reader document. The same markdown file is parsed by:
- Senior engineers, who skim for the relevant section and fill in the rest from memory.
- Junior engineers, who follow it sentence-by-sentence and ask in Slack when something is unclear.
- The support agent, which executes it as a program with no out-of-band channel to ask clarifying questions.
- The internal observability tooling, which indexes it to surface "did we follow the playbook?" dashboards.
- Sometimes a third-party integrator embedding your support flows via API.
Each reader has a different tolerance for ambiguity. The senior engineer tolerates the most; the agent tolerates the least. The author optimized for the reader they pictured — almost always the human — and the audience that reads the doc tonight is the one that cannot fill in the gaps. Treating the runbook as a single document with a single audience is the source of the mismatch.
The fix is not to rewrite every runbook for the agent. That would make the docs unreadable for the humans who still own the procedures. The fix is to recognize that the document has two surfaces and to give each surface its own section.
The executable section: tools, arguments, predicates, no implicit steps
Every runbook that an agent will execute needs an explicit executable section that the human-prose section cannot replace. The executable section names tools that exist in the catalog. It enumerates the arguments those tools require and where each argument comes from in the current context. It writes decision predicates as comparisons against tool outputs, not as natural-language adjectives. It contains no implicit steps — every step is grounded in a tool call whose name appears verbatim in the agent's tool list.
A "check Stripe" line becomes, in the executable section, something close to: invoke stripe.invoice_list with the customer_id returned by crm.find_customer(account_name=ticket.account_name), filter to invoices with status in {past_due, disputed} in the last 90 days, return the list to the next step. A "if they're enterprise" predicate becomes: invoke crm.get_account(account_id) and branch on account.tier == "enterprise". The point is not that the prose is uglier; the point is that the prose has no room for implicit context. Either the step is grounded or it is not in the executable section.
The human-prose section stays — it is what the senior engineer reads, what new hires onboard against, and what the team uses to debate when the playbook is wrong. But the two sections are explicitly labeled, the agent's instruction template loads only the executable section into the prompt, and the human-prose section is loaded into a separate context block labeled as background that the agent is instructed not to translate into tool calls. The instruction discipline is straightforward: prose explains why, the executable section says what, and the agent is only allowed to act from the what.
This is the same shape Addy Osmani and others have been pushing for in agent specs more broadly — that a spec for an agent has to be structured like a design document with explicit toolchain entry points, not a wiki page written in continuous prose. Customer-support runbooks happen to be the place where the cost of skipping that discipline shows up first, because the runbook is older than the agent and was written by people who never imagined a non-human reader.
Lint the executable section against the live tool catalog
The executable section is now code, and like all code it drifts. A tool gets renamed when the platform team standardizes API surfaces. An argument that used to default to the requesting agent's identity now needs to be passed explicitly. A new tool is added that replaces three older ones. The runbook does not change, because the human who wrote it does not read the tool-catalog changelog.
The mechanism that closes this gap is a linter that runs against every runbook at merge time and on a schedule. The linter parses each executable section, extracts every tool name and every argument reference, and diffs them against the live tool catalog the agent will use at runtime. Any named tool that does not exist in the catalog fails the lint. Any required argument that is not sourced from a prior step's output, the ticket payload, or a typed input fails the lint. Any predicate that compares a tool output against a value the tool's schema cannot produce fails the lint.
A runbook with a failing lint cannot be the active version the agent loads. The team that owns the runbook can either fix the executable section against the new catalog or pin the runbook to the old catalog version, but neither team is allowed to ship a runbook the agent cannot mechanically follow. This is the same discipline that schema-validating tool inputs at runtime gives you — but applied at authoring time, so the failure is caught when the customer-success engineer hits save, not when a customer's ticket is half-resolved.
Replay old tickets as the agent's regression set
Even with a clean linted runbook, the agent will sometimes pick a step that looks right in isolation but disagrees with how a senior support engineer would actually have handled the ticket. The mechanism that catches this is the same pattern that has become standard for evaluating agents generally: a regression set of past tickets with known good resolutions, replayed through the agent against the current runbook on every change.
The set does not need to be enormous. Twenty to fifty replay traces — drawn from real tickets that senior engineers resolved correctly — is enough to catch most regressions. The scoring rubric is not "did the agent produce the same words as the human" but "did the agent pick the same tool sequence at the same decision points." A ticket where the senior engineer called crm.get_account then stripe.invoice_list then escalated to a named AE should produce, in replay, an agent trace that does the same three steps in the same order. Divergences are surfaced as eval failures and inspected — sometimes the agent is right and the human was lazy, sometimes the runbook drifted, sometimes the agent invented a step.
This eval has to live somewhere the support team and the agent platform team both look at. If only the platform team watches it, the support team will keep editing the runbook in ways that fail the eval and not understand why their merge was blocked. If only the support team watches it, the platform team will ship model upgrades that quietly invalidate half the goldens. The eval is the contract between the two teams.
A point that the Mercadona-style golden-set discipline makes well: the eval set must be immutable relative to model output. If you regenerate goldens from the agent's current behavior every time you change the model, the eval becomes a measurement of self-consistency, not correctness. The goldens come from the humans who actually resolved the tickets correctly, and they stay frozen until a human reviews and refreshes them.
The org failure that the playbook diff makes visible
The diff between the human prose and the executable section is not just a doc artifact. It is the most honest map of who owns what in your support stack. Every place where the human prose is rich and the executable section is empty is a place where a human is doing implicit work the agent platform has not encoded. Every place where the executable section is rich and the human prose is thin is a place where the agent has capabilities the support team does not know about and is not training new hires on.
The teams that ship support agents without this diff end up in the same place: the support team blames the agent platform for an agent that does the wrong thing, the agent platform team blames the support team for runbooks that "aren't structured properly," and the customer waits while the post-mortem identifies a translation no one owned. The teams that ship working agents recognize that the runbook is now a contract — readable by both audiences, lintable against the tool catalog, replayable against past tickets — and that the contract has to be owned by both sides or it stops being a contract at all.
The architectural realization underneath all of this is straightforward. Runbooks used to be advisory documents for humans who could fill in the gaps. They are now executable artifacts whose readers include systems that cannot. The team that did not rewrite the playbook for the new reader is paying the cost of a mistranslation in every ticket — sometimes as a wrong answer, sometimes as a confidently-wrong summary that a human then has to undo, and either way as a tax on the customer's trust that compounds until someone notices the playbook has been talking past the agent for months.
- https://earezki.com/ai-news/2026-05-17-the-runbook-is-already-lying-to-you/
- https://incident.io/blog/runbook-automation-tools-2026-the-complete-guide
- https://cast.ai/blog/agentic-runbooks/
- https://cutover.com/blog/beyond-automation-agentic-ai-importance-runbook-task-models
- https://decagon.ai/blog/ai-chatbot-challenges
- https://www.bland.ai/blogs/ai-agent-challenge-in-customer-support
- https://manveerc.substack.com/p/ai-agent-hallucinations-prevention
- https://www.roborhythms.com/fix-agent-tool-hallucinations-4-section-prompt/
- https://github.com/SolaceLabs/solace-agent-mesh/issues/1261
- https://arxiv.org/pdf/2507.15296
- https://dust.tt/blog/how-to-write-ai-agent-instructions
- https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/
- https://document360.com/blog/documentation-for-humans-and-ai-agents/
- https://www.alation.com/blog/how-to-write-ai-ready-documentation/
- https://medium.com/@Praxen/agent-evaluation-suites-that-actually-catch-failures-02f4e9ab0243
- https://www.sentrial.com/blog/ai-agent-regression-testing-that-catches-silent-failures
- https://medium.com/@federicomoreno613/golden-datasets-the-foundation-of-reliable-ai-evaluation-486ce97ce89d
- https://www.cnbc.com/2026/04/01/ai-chatbot-customer-service-complaints-refunds.html
- https://masterofcode.com/blog/ai-in-customer-service-statistics
