Skip to main content

4 posts tagged with "llm-tools"

View all tags

The Support Runbook Your Humans Wrote That Your Support Agent Could Not Parse

· 11 min read
Tian Pan
Software Engineer

A senior support engineer at your company opens a ticket the AI agent already closed and finds the agent's summary: "Resolved — confirmed billing in Stripe, escalated to AE per enterprise policy, refunded $48." Every clause is plausible. None of them happened. There is no tool named check_stripe. There is no tool that looks up customer tier. The "AE" the summary mentions does not work the account anymore. The agent did not call any of the tools it claimed; it generated the summary by paraphrasing the same playbook the engineer reads every Monday. The customer is still waiting.

The runbook the agent read was correct. The customer-success team had spent two years tuning it. Senior engineers had used it to onboard juniors. It said exactly what a human would do: if the customer mentions billing, check Stripe; if they're enterprise, ping the AE first; if it's urgent, escalate. The agent's failure was not that it ignored the runbook. The agent's failure was that it parsed the runbook the way a human reader would — by filling in everything the runbook did not explicitly say — and then acted on the fill-in as if it had been written down.

The Feature Store Your Agent Reinvented Badly

· 10 min read
Tian Pan
Software Engineer

Watch a support agent handle one conversation, and count how many times it computes "churn risk." First when it triages the ticket. Again when it decides whether to offer a discount. A third time when it drafts the escalation summary. Each time, it re-reads the raw orders table, re-runs an inline aggregation, and produces a number. The three numbers don't match. Nobody notices, because they were never written down next to each other.

This is feature engineering. The agent is doing it on every turn, in prose, and doing it worse than a pipeline you would have laughed out of code review a decade ago.

The machine learning world already solved this. The solution is called a feature store, and the discipline it enforces — compute a feature once, name it, version it, serve it consistently — is exactly the discipline an agent throws away the moment you hand it a database tool. Your agent didn't avoid building a feature pipeline. It built one. It just built the worst one in the building.

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

Tool Schemas Are Prompts, Not API Contracts

· 11 min read
Tian Pan
Software Engineer

The most expensive line in your agent codebase is the one that auto-generates tool schemas from your existing OpenAPI spec. It looks like a clean engineering choice — single source of truth, no duplication, auto-sync on every API change. It is also why your agent picks searchUsersV2 when it should have picked searchUsersV3, fills limit=20 because your spec's example said so, and silently drops the tenant_id because it was buried in the seventh parameter slot.

Nothing about this shows up in unit tests. The schema validates. The endpoint exists. The agent's call is well-formed JSON. And yet the model uses the tool wrong, every time, in ways your QA pipeline never sees because it tests the API, not the agent's reading of the API.

The bug is conceptual. OpenAPI was designed to describe APIs to humans who write SDK code; tool schemas are read by an LLM at every single call as a piece of the prompt. Treating them as the same artifact is the same category mistake as auto-generating user-facing copy from your database column names.