Skip to main content

28 posts tagged with "llm-agents"

View all tags

The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments

· 9 min read
Tian Pan
Software Engineer

Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.

The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.

The traceparent header your gateway dropped between LLM call and tool execution

· 11 min read
Tian Pan
Software Engineer

A user reports that the agent answered correctly but the database update never happened. You open your observability tool, search for the trace ID stamped on the user-facing conversation, and find a clean tree — five LLM calls, four tool decisions, a final response. No errors. Then you search for the tool service that owns the database write, and you find another trace, with the same wall-clock window but a different trace ID, a different root span, and no link back. You search the gateway logs. Three more orphan traces. The agent run that looked like a single coherent interaction in the chat UI fragmented, in your tracing backend, into a forest.

The header that should have stitched it together is traceparent. It is a 55-byte W3C-standard string that every span in a distributed system uses to identify its parent. It is also, in most production LLM agent stacks, dropped at least once between the user's request and the side effect the user actually wanted.

The Verification Step Your Agent Pretended to Perform

· 8 min read
Tian Pan
Software Engineer

Your prompt says "verify X before returning." The trace shows the string "verified X." A week later you discover X was never verified — not once, not for any request, not in any environment. The model learned that emitting the phrase satisfies the rubric. The verification it claimed to do is a sentence in a text generator's output, not an action taken in the world.

This is a different failure than hallucination. Hallucination is the model fabricating a fact about the world. Self-attested verification is the model fabricating a fact about its own process. The first is a knowledge problem. The second is a substrate problem — you asked a string-producing system to perform an action it has no mechanism to perform, and it produced a string that looks like the action would have looked.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Backfill Problem: Why Agent Memory Needs Migrations Like a Database

· 11 min read
Tian Pan
Software Engineer

You shipped a better memory format on a Tuesday. The new schema splits a freeform summary string into structured fields — entities, preferences, last_verified_at — because the old blob was hard to retrieve against and impossible to update cleanly. The change is obviously correct. It passes review. It ships.

What you did not notice is that every memory written before Tuesday is now subtly wrong. Some records still have the old summary field and no entities, so the retrieval code that now keys on entities skips them. A few have a summary that the new parser interprets as an empty preference set. The agent didn't crash. It just quietly forgot a year of accumulated context, and nobody filed a bug because nothing looked broken — the agent still answered, just worse.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

The Eval Suite That Became the Spec Nobody Agreed To

· 8 min read
Tian Pan
Software Engineer

Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?

For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.

This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.

You Can't Email a Changelog to a Model: Why API Deprecation Breaks When the Caller Is an LLM

· 10 min read
Tian Pan
Software Engineer

API deprecation is a communication protocol that assumes the receiver can read. You publish a changelog, send an email to registered developers, add a Deprecation header, give six months of notice, and trust that a human on the other end will see the warning, file a ticket, and migrate before the sunset date. That entire workflow quietly stopped working the moment your most active caller became a language model.

An LLM does not subscribe to your developer newsletter. It does not have a Slack channel where someone pastes your migration guide. It rediscovers your API on every single call — from a tool description it was handed, a documentation page that may be eighteen months stale, or a memory of how your API looked in its training data. There is no persistent client you can version, notify, or page. Each request is a fresh negotiation with an entity that has no memory of your last announcement and no obligation to read your next one.

This is not a hypothetical. As agents become the dominant consumers of internal and external APIs, the deprecation playbook every backend team has used for fifteen years is failing in a specific, diagnosable way — and most teams discover it only when a "deprecated for six months" endpoint is still serving an agent in production with no path to make it stop.

The Prompt Surface Area Problem: Why Adding a Tool Is Never Just Adding a Tool

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an LLM-powered agent has been tempted by a simple mental model: a tool is a function. Adding a tool means the agent can do one more thing. The cost is a few lines of documentation in the system prompt, maybe a schema definition, maybe one new entry in a tool registry. It feels additive — linear.

It isn't. Each new tool doesn't expand what the agent can do in isolation; it expands what the agent can do in combination with every tool already there. That distinction is the source of a class of production failures that no amount of prompt tweaking can fix after the fact, because the problem is architectural. The prompt surface area problem is real, it compounds quickly, and most teams don't see it until they're already deep in it.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

Tool Output Schema Design: How Your Tool Responses Shape Agent Reasoning

· 9 min read
Tian Pan
Software Engineer

Most teams designing LLM agents spend considerable effort on tool selection and system prompt wording. Almost none of them think carefully about what their tools return. That's a mistake with compounding consequences — because the shape of a tool response determines how well the agent can reason about it, how much context window it consumes, and how often it hallucinates an interpretation the tool never intended.

Tool output schema design is infrastructure, not plumbing. Get it wrong and your agent will fail in ways that look like reasoning problems when they're actually schema problems.